# SQL Data Cleaning for Mobile Coverage Dataset

### Purpose:

This SQL code is designed to clean and prepare a public dataset of mobile coverage data in Google BigQuery.

The primary goal is data cleaning, which includes creating a new table, normalizing categorical features, handling missing data, and removing outliers from numerical features.

This process is essential for ensuring data quality and reliability for subsequent analysis.

### Import libraries and modules

In [1]:
import pandas as pd
from google.cloud import bigquery

### Import function: Interactive SQL Query to Pandas DataFrame Converter

In [2]:
# Import the custom query_df and run_query functions for executing BigQuery queries
from query_functions import query_df  # Execute the query and return the output as a DataFrame
from query_functions import run_query  # Execute the query without returning a DataFrame, used for INSERT, UPDATE, DELETE, etc.

### Datasets and Tables paths to Google BigQuery

In [3]:
# Catalonian mobile coverage eu (2015-2017) --> mobile_data_2015_2017_cleaned
mobile_data_cleaned = "bq-analyst-230590.project_cat_mobile_coverage_2015_2017.mobile_data_2015_2017_cleaned"

### Creating Table and Columns

This following code creates a **copy of the 'mobile_data_2015_2017' table** by selecting specific columns and saves it as 'mobile_data_2015_2017_cleaned

In [4]:
# Datasets: {mobile_data_cleaned}

# SQL query:
query = f"""
CREATE TABLE IF NOT EXISTS `{mobile_data_cleaned}` AS
SELECT
    date,
    hour,
    lat,
    long,
    signal,
    network,
    operator,
    status,
    description,
    net,
    speed,
    satellites,
    precission,
    provider,
    activity,
    downloadSpeed,
    uploadSpeed,
    postal_code,
    town_name,
    position_geom
FROM
  `bigquery-public-data.catalonian_mobile_coverage_eu.mobile_data_2015_2017`
    """

# Execute the query
run_query(query)

Query successfully executed, and the table has been updated.


    - Preview

In [5]:
# Datasets: {mobile_data_cleaned}

# SQL query:
query = f"""
SELECT *
FROM `{mobile_data_cleaned}` 
WHERE
    network IS NOT NULL
    AND postal_code IS NOT NULL
LIMIT 10
    """

# Execute the query
raw_data = query_df(query)

# Display data
raw_data

Unnamed: 0,date,hour,lat,long,signal,network,operator,status,description,net,speed,satellites,precission,provider,activity,downloadSpeed,uploadSpeed,postal_code,town_name,position_geom
0,2015-09-16,12:44:35,42.36216,1.87294,22,movistar,ONO,2,STATE_EMERGENCY_ONLY,3G,71.2,9.0,24.0,gps,IN_VEHICLE,,,170617,Das,POINT(1.87294 42.36216)
1,2015-04-01,19:14:50,41.93302,2.24372,17,movistar,ONO,2,STATE_EMERGENCY_ONLY,2G,13.4,9.0,20.0,gps,IN_VEHICLE,,,82981,Vic,POINT(2.24372 41.93302)
2,2016-05-27,16:36:47,42.16972,2.48033,27,vodafone,ONO,2,STATE_EMERGENCY_ONLY,3G,9.8,5.0,30.0,gps,ON_FOOT,,,171143,Olot,POINT(2.48033 42.16972)
3,2016-06-02,12:37:49,42.1723,2.47677,10,vodafone,ONO,2,STATE_EMERGENCY_ONLY,3G,0.2,0.0,25.0,gps,STILL,,,171143,Olot,POINT(2.47677 42.1723)
4,2016-06-02,09:09:56,42.18748,2.47931,15,vodafone,ONO,2,STATE_EMERGENCY_ONLY,3G,25.9,7.0,15.0,gps,IN_VEHICLE,,,171143,Olot,POINT(2.47931 42.18748)
5,2016-06-03,07:13:23,41.98194,2.7812,12,vodafone,ONO,2,STATE_EMERGENCY_ONLY,3G,114.5,4.0,20.0,gps,IN_VEHICLE,,,171557,Salt,POINT(2.7812 41.98194)
6,2015-08-26,10:33:26,42.07678,1.8132,8,movistar,ONO,2,STATE_EMERGENCY_ONLY,3G,73.0,8.0,9.0,gps,IN_VEHICLE,,,80116,Avià,POINT(1.8132 42.07678)
7,2015-09-16,14:51:45,42.09861,1.84183,12,movistar,ONO,2,STATE_EMERGENCY_ONLY,3G,0.1,3.0,20.0,fused,ON_FOOT,,,80229,Berga,POINT(1.84183 42.09861)
8,2015-09-16,13:19:15,42.14354,1.86323,11,movistar,ONO,2,STATE_EMERGENCY_ONLY,3G,66.7,7.0,35.0,gps,IN_VEHICLE,,,82687,Cercs,POINT(1.86323 42.14354)
9,2015-08-27,11:19:43,42.1272,1.86118,15,movistar,ONO,2,STATE_EMERGENCY_ONLY,2G,64.4,10.0,69.0,gps,IN_VEHICLE,,,82687,Cercs,POINT(1.86118 42.1272)


In [6]:
raw_data.columns

Index(['date', 'hour', 'lat', 'long', 'signal', 'network', 'operator',
       'status', 'description', 'net', 'speed', 'satellites', 'precission',
       'provider', 'activity', 'downloadSpeed', 'uploadSpeed', 'postal_code',
       'town_name', 'position_geom'],
      dtype='object')

    - Add 'province' new column

A new column, **province** is being added to the BigQuery dataset to include province values, enhancing the dataset for regional analysis.

This new column is currently empty and will be populated based on the structure of Spanish **postal codes**, with the first two digits determining the province name. If the 'postal_code' doesn't match any of these values, it assigns 'Not defined'. Existing rows with pre-defined province values will not be affected by this update.

In [7]:
# Datasets: {mobile_data_cleaned}

# SQL query: create and add 'province' column
query = f"""
ALTER TABLE `{mobile_data_cleaned}`
ADD COLUMN province STRING;
    """

# Execute the query
run_query(query)

Query successfully executed, and the table has been updated.


    - Populate 'province' values

In [8]:
# Datasets: {mobile_data_cleaned}

# SQL query: populate 'province' values
query = f"""
UPDATE `{mobile_data_cleaned}`
SET province = CASE
  # First two digits condition (LEFT) for postal code using CAST to maintain a consistent string data type.
  WHEN LEFT(CAST(postal_code AS STRING), 2) = '08' THEN 'Barcelona'
  WHEN LEFT(CAST(postal_code AS STRING), 2) = '25' THEN 'Lleida'
  WHEN LEFT(CAST(postal_code AS STRING), 2) = '17' THEN 'Girona'
  WHEN LEFT(CAST(postal_code AS STRING), 2) = '43' THEN 'Tarragona'
  ELSE 'Not defined'
END
WHERE province IS NULL
    """

# Execute the query
run_query(query)

Query successfully executed, and the table has been updated.


    - Rename columns to avoid confusion and fix typos

In the original dataset, there were three key columns: 'net,' 'network,' and 'operator.' These columns contained information about the type of network (e.g., 4G, 3G, 2G), the network provider (e.g., Movistar, Orange), and the specific operator (e.g., Movistar, Orange, ONO, Lowi) associated with the data.

To avoid potential confusion, especially between the 'net' and 'network' columns, the following SQL code is designed to **rename the 'network' column to 'net_provider.'** This renaming ensures that the purpose and meaning of each column are clear and helps improve the overall clarity of the dataset.

In [9]:
# Datasets: {mobile_data_cleaned}

# SQL query: rename 'network' for 'net_provider'
query = f"""
ALTER TABLE `{mobile_data_cleaned}`
RENAME COLUMN network TO net_provider
    """

# Execute the query
run_query(query)

Query successfully executed, and the table has been updated.


In [10]:
# Datasets: {mobile_data_cleaned}

# SQL query: rename 'network' for 'net_provider'
query = f"""
ALTER TABLE `{mobile_data_cleaned}`
RENAME COLUMN precission TO precision
    """

# Execute the query
run_query(query)

Query successfully executed, and the table has been updated.


    - Column types

In [11]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           10 non-null     object 
 1   hour           10 non-null     object 
 2   lat            10 non-null     float64
 3   long           10 non-null     float64
 4   signal         10 non-null     int64  
 5   network        10 non-null     object 
 6   operator       10 non-null     object 
 7   status         10 non-null     int64  
 8   description    10 non-null     object 
 9   net            10 non-null     object 
 10  speed          10 non-null     float64
 11  satellites     10 non-null     float64
 12  precission     10 non-null     float64
 13  provider       10 non-null     object 
 14  activity       10 non-null     object 
 15  downloadSpeed  0 non-null      object 
 16  uploadSpeed    0 non-null      object 
 17  postal_code    10 non-null     object 
 18  town_name    

In this case, '**date**' and '**hour**' have been transformed to 'object' types. However, in the original Google BigQuery dataset, their formats are 'date' and 'time,' respectively. 

Just in case you need to update these formats in the original table in BigQuery, here are the required queries.

(Please note that we are displaying this data as a Python DataFrame for reference, but the actual changes must be made in the original BigQuery dataset as we will be directly querying it. We won't be working with this DataFrame for further analysis.)

In [12]:
# SQL query: in BigQuery,convert 'date' to date format and 'hour' to time format
query = f"""
    # Convert 'date' column to date data type
    UPDATE `{mobile_data_cleaned}`
    SET date = CAST(date AS DATE)
    """
# Execute the query
# run_query(query)


query = f"""
    # Convert 'date' column to date data type
    UPDATE `{mobile_data_cleaned}`
    SET hour = CAST(hour AS TIME)
    """

# Execute the query
# run_query(query)

### Categorical Features

#### Standardize data names

**Net provider**

In [13]:
# Datasets: {mobile_data_cleaned}

# SQL query: original count of unique net_provider
query = f"""
SELECT
 COUNT(DISTINCT(net_provider)) original_unique_net_provider
FROM `{mobile_data_cleaned}`
    """

# Execute the query
query_df(query)

Unnamed: 0,original_unique_net_provider
0,250


In [14]:
# Datasets: {mobile_data_cleaned}

# SQL query: example of net_provider names not standardized (Movistar)
query = f"""
SELECT
 DISTINCT(net_provider)
FROM `{mobile_data_cleaned}`
WHERE
 UPPER(net_provider) LIKE 'MO%IS%'
    """

# Execute the query
query_df(query)

Unnamed: 0,net_provider
0,movistar
1,Mobistar
2,Movistar | Particular
3,Movistar | Empresa


    - Unusual characters

In [15]:
# Datasets: {mobile_data_cleaned}

# SQL query: unusual characters to take into account
query = f"""
SELECT DISTINCT net_provider
FROM `{mobile_data_cleaned}`
# Filter out net_provider values that contain non-alphanumeric characters (letters, digits, and spaces).
WHERE REGEXP_CONTAINS(net_provider, r'[^a-zA-Z0-9 ]')
    """

# Execute the query
query_df(query)

Unnamed: 0,net_provider
0,T-Mobile NL
1,Orange F | Orange
2,VodaCom-MZ
3,T-Mobile A | Orange
4,Buscando servi├ºo
5,TIM@sea
6,CHN-UNICOM
7,Telekom.de
8,EE | Orange
9,France Telecom Espa├▒a SA


This following code standardizes the 'net_provider' names in the table. It employs various functions like UPPER (to convert to uppercase), TRIM (to remove leading/trailing spaces), and '=' or LIKE (with % wildcard for pattern matching) to recognize and group similar operator names under a common name.

For example, 'movistar' and 'Mobistar' are both categorized as 'Movistar'. It ensures uniformity in the 'net_provider' column, making the data more consistent and easier to work with.

    - Standardize net_provider names

In [16]:
# Datasets: {mobile_data_cleaned}

# SQL query: Update net_provider names
query = f"""
UPDATE `{mobile_data_cleaned}`
SET net_provider = CASE
    # WHEN UPPER(TRIM(net_provider)) = '3' THEN 'Three'
    WHEN UPPER(TRIM(net_provider)) LIKE '3%' THEN 'Three'
    #WHEN UPPER(TRIM(net_provider)) = 'BITEL' THEN 'Bytel'
    WHEN UPPER(TRIM(net_provider)) LIKE '%AIRTEL%' THEN 'Airtel'
    WHEN UPPER(TRIM(net_provider)) LIKE '%B%TEL%' THEN 'Bytel'
    WHEN UPPER(TRIM(net_provider)) LIKE '%BOUYGUES%' THEN 'Bouygues Telecom'
    WHEN UPPER(TRIM(net_provider)) LIKE 'BUSCANDO %' THEN 'Sense Servei'
    WHEN UPPER(TRIM(net_provider)) = 'CABLE MOVIL' THEN 'Cable Movil'
    WHEN UPPER(TRIM(net_provider)) = 'CABLEMOVIL' THEN 'Cable Movil'
    WHEN UPPER(TRIM(net_provider)) LIKE 'CLARO%' THEN 'Claro'
    WHEN UPPER(TRIM(net_provider)) LIKE 'CUBACEL%' THEN 'Cubacel'
    WHEN UPPER(TRIM(net_provider)) LIKE 'E-%' THEN 'EE'
    WHEN UPPER(TRIM(net_provider)) LIKE '%EMER%' THEN 'Nomes Trucades Emergencies'
    WHEN UPPER(TRIM(net_provider)) LIKE 'JAZZTEL%' THEN 'Jazztel'
    WHEN UPPER(TRIM(net_provider)) LIKE 'LOWI%' THEN 'Lowi'
    WHEN LEFT(UPPER(TRIM(net_provider)), 14) = 'FRANCE TELECOM' THEN 'France Telcom Espana SA'
    WHEN UPPER(TRIM(net_provider)) LIKE 'MASMOVIL%' THEN 'Masmovil'
    WHEN UPPER(TRIM(net_provider)) LIKE '%MOBILAND%' THEN 'Mobiland'
    WHEN UPPER(TRIM(net_provider)) LIKE 'MOVILNET%' THEN 'Movilnet'
    WHEN UPPER(TRIM(net_provider)) LIKE '%MO%ISTAR%' THEN 'Movistar'
    # WHEN UPPER(TRIM(net_provider)) LIKE 'MOBISTAR%' THEN 'Movistar'
    # WHEN UPPER(TRIM(net_provider)) LIKE '%MOBISTAR%' THEN 'Movistar'
    # WHEN UPPER(TRIM(net_provider)) LIKE 'MOVISTAR%' THEN 'Movistar'
    WHEN UPPER(TRIM(net_provider)) LIKE 'MTS%' THEN 'MTS'
    WHEN UPPER(TRIM(net_provider)) LIKE '%ORANGE%' THEN 'Orange'
    WHEN UPPER(TRIM(net_provider)) = 'A1' THEN 'Orange'
    WHEN UPPER(TRIM(net_provider)) LIKE 'O2%' THEN 'O2'
    WHEN UPPER(TRIM(net_provider)) LIKE 'PE%EPHONE' THEN 'Pepephone'
    # WHEN UPPER(TRIM(net_provider)) = 'PELEPHONE' THEN 'Pepephone'
    # WHEN UPPER(TRIM(net_provider)) = 'PEPEPHONE' THEN 'Pepephone'
    WHEN UPPER(TRIM(net_provider)) = 'PROXIMUS' THEN 'Proximus'
    WHEN UPPER(TRIM(net_provider)) LIKE 'REPUBLICA%' THEN 'Republica Movil'
    WHEN UPPER(TRIM(net_provider)) LIKE 'SENSE %' THEN 'Sense Servei'
    WHEN UPPER(TRIM(net_provider)) LIKE 'SIMYO%' THEN 'Simyo'
    WHEN UPPER(TRIM(net_provider)) LIKE 'SIN %' THEN 'Sense Servei'
    WHEN UPPER(TRIM(net_provider)) LIKE 'TIGO%' THEN 'TIGO'
    WHEN UPPER(TRIM(net_provider)) LIKE 'TIM%' THEN 'TIM'
    WHEN UPPER(TRIM(net_provider)) LIKE 'TELEF%' THEN 'Movistar'
    # WHEN (TRIM(network)) = 'Telefonica Moviles Espana' THEN 'Movistar'
    # WHEN (TRIM(network)) = 'Telef├│nica M├│viles Espa├▒a' THEN 'Movistar'
    WHEN UPPER(TRIM(net_provider)) LIKE 'TDC%' THEN 'TDC Mobile'
    WHEN UPPER(TRIM(net_provider)) LIKE 'TELEKOM%' THEN 'Telekom'
    WHEN UPPER(TRIM(net_provider)) LIKE '%TELENOR%' THEN 'Telenor'
    WHEN UPPER(TRIM(net_provider)) LIKE '%T-MOBILE%' THEN 'T-Mobile'
    WHEN UPPER(TRIM(net_provider)) LIKE 'VIVO%' THEN 'Vivo'
    WHEN UPPER(TRIM(net_provider)) LIKE '%VODAFONE%' THEN 'Vodafone'
    WHEN UPPER(TRIM(net_provider)) = 'VF ES' THEN 'Vodafone'
    WHEN UPPER(TRIM(net_provider)) LIKE 'VODACOM%' THEN 'Vodacom'
    WHEN UPPER(TRIM(net_provider)) LIKE 'YOIGO%' THEN 'Yoigo'
    ELSE net_provider
END
WHERE net_provider IS NOT NULL;

    """

# Execute the query
run_query(query)

Query successfully executed, and the table has been updated.


    - Null values in net_provider

In [17]:
# Datasets: {mobile_data_cleaned}

# SQL query: Count null values and calculate the percentage
query = f"""
WITH NetProviderCount AS (
 SELECT
   COUNT(*) AS total_count_net_provider
 FROM `{mobile_data_cleaned}`
),
NullProviderCount AS (
 SELECT
   COUNT(*) AS null_count_net_provider
 FROM `{mobile_data_cleaned}`
 WHERE
   net_provider IS NULL
   OR net_provider = 'null'
)

SELECT
 null_count_net_provider,
 CONCAT(ROUND((null_count_net_provider/total_count_net_provider)*100, 2), " %") AS perc_null_net_provider
FROM NetProviderCount, NullProviderCount
"""

# Execute the query
query_df(query)

Unnamed: 0,null_count_net_provider,perc_null_net_provider
0,53284,0.45 %


     - Remove rows with specified net_provider values and NULL net_providers from the table.

In [18]:
# Datasets: {mobile_data_cleaned}

# SQL query: drop specific rows
query = f"""
DELETE FROM `{mobile_data_cleaned}`
WHERE
 net_provider IS NULL
 OR net_provider IN ('000000', '21303', '21401', '23866', '90118', '?????', 'null')
    """

# Execute the query
run_query(query)

Query successfully executed, and the table has been updated.


    - Remove rows where the activity count for each net_provider is less than 10

In [19]:
# Datasets: {mobile_data_cleaned}

# SQL query: drop specific rows
query = f"""
DELETE FROM `{mobile_data_cleaned}`
WHERE net_provider IN (
    SELECT net_provider
    FROM `{mobile_data_cleaned}`
    GROUP BY net_provider
    HAVING COUNT(*) < 10
)
    """
# Execute the query
run_query(query)

Query successfully executed, and the table has been updated.


In [20]:
# Datasets: {mobile_data_cleaned}

# SQL query: final count of unique net_provider
query = f"""
SELECT
 COUNT(DISTINCT(net_provider)) final_unique_net_provider
FROM `{mobile_data_cleaned}`
    """

# Execute the query
query_df(query)

Unnamed: 0,final_unique_net_provider
0,122


In [21]:
# Datasets: {mobile_data_cleaned}

# SQL query: 
query = f"""
SELECT
 net_provider,
 COUNT(*) activity_count
FROM `{mobile_data_cleaned}`
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20
"""

# Execute the query 
query_df(query)

Unnamed: 0,net_provider,activity_count
0,Movistar,5087640
1,Orange,2932386
2,Vodafone,2822286
3,Yoigo,376842
4,MetroPCS,95162
5,Eroski Movil,40041
6,Jazztel,28894
7,France Telcom Espana SA,28113
8,TICAE,26763
9,Lowi,21742


**Operator**

In [22]:
# Datasets: {mobile_data_cleaned}

# SQL query: original count of unique net_provider
query = f"""
SELECT
 COUNT(DISTINCT(operator)) original_unique_net_operator
FROM `{mobile_data_cleaned}`
    """

# Execute the query
query_df(query)

Unnamed: 0,original_unique_net_operator
0,243


    - Standardize operator names

In [23]:
# Datasets: {mobile_data_cleaned}

# SQL query: Update operator names
query = f"""
UPDATE `{mobile_data_cleaned}`
SET operator = CASE
    # WHEN UPPER(TRIM(operator)) = '3' THEN 'Three'
    WHEN UPPER(TRIM(operator)) LIKE '3%' THEN 'Three'
    # WHEN UPPER(TRIM(operator)) = 'BITEL' THEN 'Bytel'
    WHEN UPPER(TRIM(operator)) LIKE '%AIRTEL%' THEN 'Airtel'
    WHEN UPPER(TRIM(operator)) LIKE '%B%TEL%' THEN 'Bytel'
    WHEN UPPER(TRIM(operator)) LIKE '%BOUYGUES%' THEN 'Bouygues Telecom'
    WHEN UPPER(TRIM(operator)) LIKE 'BUSCANDO %' THEN 'Sense Servei'
    WHEN UPPER(TRIM(operator)) = 'CABLE MOVIL' THEN 'Cable Movil'
    WHEN UPPER(TRIM(operator)) = 'CABLEMOVIL' THEN 'Cable Movil'
    WHEN UPPER(TRIM(operator)) LIKE 'CLARO%' THEN 'Claro'
    WHEN UPPER(TRIM(operator)) LIKE 'CUBACEL%' THEN 'Cubacel'
    WHEN UPPER(TRIM(operator)) LIKE 'E-%' THEN 'EE'
    WHEN UPPER(TRIM(operator)) LIKE '%EMER%' THEN 'Nomes Trucades Emergencies'
    WHEN UPPER(TRIM(operator)) LIKE 'JAZZTEL%' THEN 'Jazztel'
    WHEN UPPER(TRIM(operator)) LIKE 'LOWI%' THEN 'Lowi'
    WHEN LEFT(UPPER(TRIM(operator)), 14) = 'FRANCE TELECOM' THEN 'France Telcom Espana SA'
    WHEN UPPER(TRIM(operator)) LIKE 'MASMOVIL%' THEN 'Masmovil'
    WHEN UPPER(TRIM(operator)) LIKE '%MOBILAND%' THEN 'Mobiland'
    WHEN UPPER(TRIM(operator)) LIKE 'MOVILNET%' THEN 'Movilnet'
    WHEN UPPER(TRIM(operator)) LIKE '%MO%ISTAR%' THEN 'Movistar'
    # WHEN UPPER(TRIM(operator)) LIKE 'MOBISTAR%' THEN 'Movistar'
    # WHEN UPPER(TRIM(operator)) LIKE '%MOBISTAR%' THEN 'Movistar'
    # WHEN UPPER(TRIM(operator)) LIKE 'MOVISTAR%' THEN 'Movistar'
    WHEN UPPER(TRIM(operator)) LIKE 'MTS%' THEN 'MTS'
    WHEN UPPER(TRIM(operator)) LIKE '%ORANGE%' THEN 'Orange'
    WHEN UPPER(TRIM(operator)) = 'A1' THEN 'Orange'
    WHEN UPPER(TRIM(operator)) LIKE 'O2%' THEN 'O2'
    WHEN UPPER(TRIM(operator)) LIKE 'PE%EPHONE' THEN 'Pepephone'
    # WHEN UPPER(TRIM(operator)) = 'PELEPHONE' THEN 'Pepephone'
    # WHEN UPPER(TRIM(operator)) = 'PEPEPHONE' THEN 'Pepephone'
    WHEN UPPER(TRIM(operator)) = 'PROXIMUS' THEN 'Proximus'
    WHEN UPPER(TRIM(operator)) LIKE 'REPUBLICA%' THEN 'Republica Movil'
    WHEN UPPER(TRIM(operator)) LIKE 'SENSE %' THEN 'Sense Servei'
    WHEN UPPER(TRIM(operator)) LIKE 'SIMYO%' THEN 'Simyo'
    WHEN UPPER(TRIM(operator)) LIKE 'SIN %' THEN 'Sense Servei'
    WHEN UPPER(TRIM(operator)) LIKE 'TIGO%' THEN 'TIGO'
    WHEN UPPER(TRIM(operator)) LIKE 'TIM%' THEN 'TIM'
    WHEN UPPER(TRIM(operator)) LIKE 'TELEF%' THEN 'Movistar'
    # WHEN (TRIM(network)) = 'Telefonica Moviles Espana' THEN 'Movistar'
    # WHEN (TRIM(network)) = 'Telef├│nica M├│viles Espa├▒a' THEN 'Movistar'
    WHEN UPPER(TRIM(operator)) LIKE 'TDC%' THEN 'TDC Mobile'
    WHEN UPPER(TRIM(operator)) LIKE 'TELEKOM%' THEN 'Telekom'
    WHEN UPPER(TRIM(operator)) LIKE '%TELENOR%' THEN 'Telenor'
    WHEN UPPER(TRIM(operator)) LIKE '%T-MOBILE%' THEN 'T-Mobile'
    WHEN UPPER(TRIM(operator)) LIKE 'VIVO%' THEN 'Vivo'
    WHEN UPPER(TRIM(operator)) LIKE '%VODAFONE%' THEN 'Vodafone'
    WHEN UPPER(TRIM(operator)) = 'VF ES' THEN 'Vodafone'
    WHEN UPPER(TRIM(operator)) LIKE 'VODACOM%' THEN 'Vodacom'
    WHEN UPPER(TRIM(operator)) LIKE 'YOIGO%' THEN 'Yoigo'
    ELSE operator
END
WHERE operator IS NOT NULL;
    """

# Execute the query
run_query(query)

Query successfully executed, and the table has been updated.


    - Null values in operator

In [24]:
# Datasets: {mobile_data_cleaned}

# SQL query: Count null values and calculate the percentage
query = f"""
WITH NetOperatorCount AS (
 SELECT
   COUNT(*) AS total_count_operator
 FROM `{mobile_data_cleaned}`
),
NullOperatorCount AS (
 SELECT
   COUNT(*) AS null_count_operator
 FROM `{mobile_data_cleaned}`
 WHERE
   operator IS NULL
   OR operator = 'null'
)

SELECT
 null_count_operator,
 CONCAT(ROUND((null_count_operator/total_count_operator)*100, 2), " %") AS perc_null_operator
FROM NetOperatorCount, NullOperatorCount
"""

# Execute the query
query_df(query)

Unnamed: 0,null_count_operator,perc_null_operator
0,0,0 %


In [25]:
# Datasets: {mobile_data_cleaned}

# SQL query: final count of unique operator
query = f"""
SELECT
 COUNT(DISTINCT(net_provider)) final_unique_operator
FROM `{mobile_data_cleaned}`
    """

# Execute the query
query_df(query)

Unnamed: 0,final_unique_operator
0,122


In [26]:
# Datasets: {mobile_data_cleaned}

# SQL query: 
query = f"""
SELECT
 operator,
 COUNT(*) activity_count
FROM `{mobile_data_cleaned}`
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20
"""

# Execute the query 
query_df(query)

Unnamed: 0,operator,activity_count
0,Movistar,4601611
1,Vodafone,2808951
2,Orange,2130386
3,Jazztel,465242
4,Pepephone,398089
5,Yoigo,376842
6,RACC,145208
7,Simyo,142167
8,MetroPCS,95162
9,PARLEM,48442


**Net**

In [27]:
# Datasets: {mobile_data_cleaned}

# SQL query: 
query = f"""
SELECT
 net,
 COUNT(*) activity_record
FROM `{mobile_data_cleaned}`
GROUP BY 1
"""

# Execute the query 
query_df(query)

Unnamed: 0,net,activity_record
0,4G,4532624
1,,763270
2,3G,4045652
3,2G,2349673


    - Replace null values for 'Undefined Net'

In [28]:
# Datasets: {mobile_data_cleaned}

# SQL query: 
query = f"""
UPDATE `{mobile_data_cleaned}`
SET net = CASE
  WHEN net IS NULL THEN 'Undefined net'
  ELSE net
END
WHERE net IS NULL
"""

# Execute the query 
run_query(query)

Query successfully executed, and the table has been updated.


**Provider**

In [29]:
# Datasets: {mobile_data_cleaned}

# SQL query: 
query = f"""
SELECT
 provider,
 COUNT(*) activity_record
FROM `{mobile_data_cleaned}`
GROUP BY 1
"""

# Execute the query 
query_df(query)

Unnamed: 0,provider,activity_record
0,gps,10279854
1,GPS,1125
2,22,1
3,disabled,25
4,,3
5,2017-08-28 11:31:10.000000,1
6,fused,1409817
7,19,1
8,local_database,4
9,network,286


In [30]:
# Datasets: {mobile_data_cleaned}

# SQL query: 
query = f"""
UPDATE `{mobile_data_cleaned}`
SET provider = CASE
    WHEN UPPER(TRIM(provider)) LIKE '%GPS%' THEN 'GPS'
    WHEN UPPER(TRIM(provider)) LIKE '%FUSED%' THEN 'Fused'
    WHEN UPPER(TRIM(provider)) LIKE '%NETWORK%' THEN 'Network'
    WHEN provider IS NULL THEN 'Undefined Provider'
    ELSE 'Undefined Provider'
END
WHERE provider IS NOT NULL
"""

# Execute the query 
run_query(query)

Query successfully executed, and the table has been updated.


In [31]:
# Datasets: {mobile_data_cleaned}

# SQL query: 
query = f"""
SELECT
 provider,
 COUNT(*) activity_record
FROM `{mobile_data_cleaned}`
GROUP BY 1
"""

# Execute the query 
query_df(query)

Unnamed: 0,provider,activity_record
0,Fused,1409817
1,GPS,10280980
2,,3
3,Undefined Provider,36
4,Network,383


**Postal Code** and **Town Names**

This part ensures that there are no discrepancies between postal codes and town names, and removes rows with missing values in both columns.

In [32]:
# Datasets: {mobile_data_cleaned}

# SQL query:
query = f"""
SELECT
 COUNT(*) one_valid_one_null
FROM `{mobile_data_cleaned}`
WHERE
    (postal_code IS NULL AND town_name IS NOT NULL)
    OR (postal_code IS NOT NULL AND town_name IS NULL)
"""

# Execute the query
query_df(query)

Unnamed: 0,one_valid_one_null
0,0


In [33]:
# Datasets: {mobile_data_cleaned}

# SQL query:
query = f"""
SELECT
 COUNT(*) both_features_null
FROM `{mobile_data_cleaned}`
WHERE
    postal_code IS NULL AND town_name IS NULL
"""

# Execute the query
query_df(query)

Unnamed: 0,both_features_null
0,926965


    - This means that all null values in postal_code correspond to null values in `town_name. Therefore, we are going to drop all rows with null values in both features.

In [34]:
# Datasets: {mobile_data_cleaned}

# SQL query: delete specific rows
query = f"""
DELETE FROM `{mobile_data_cleaned}`
WHERE postal_code IS NULL AND town_name IS NULL
"""

# Execute the query 
run_query(query)

Query successfully executed, and the table has been updated.


**Download Speed** and **Upload Speed**

In [35]:
# Datasets: {mobile_data_cleaned}

# SQL query:
query = f"""
SELECT
 DISTINCT(downloadSpeed)
FROM `{mobile_data_cleaned}`
LIMIT 60
"""

# Execute the query
query_df(query)

Unnamed: 0,downloadSpeed
0,


In [36]:
# Datasets: {mobile_data_cleaned}

# SQL query:
query = f"""
SELECT
 DISTINCT(uploadSpeed)
FROM `{mobile_data_cleaned}`
LIMIT 60
"""

# Execute the query
query_df(query)

Unnamed: 0,uploadSpeed
0,


    - Both columns have no recorded data and only contain null values. Since they do not provide any information, we are going to drop both columns.

In [37]:
# Datasets: {mobile_data_cleaned}

# SQL query: delete specific rows
query = f"""
ALTER TABLE `{mobile_data_cleaned}`
DROP COLUMN downloadSpeed,
DROP COLUMN uploadSpeed
"""

# Execute the query 
run_query(query)

Query successfully executed, and the table has been updated.


**Description** and **Activity**

In [38]:
# Datasets: {mobile_data_cleaned}

# SQL query:
query = f"""
SELECT
 description,
 COUNT(*) activity_record
FROM `{mobile_data_cleaned}`
GROUP BY 1
LIMIT 60
"""

# Execute the query
query_df(query)

Unnamed: 0,description,activity_record
0,STATE_POWER_OFF,26407
1,STATE_EMERGENCY_ONLY,9586391
2,STATE_IN_SERVICE,1151314
3,STATE_OUT_OF_SERVICE,142


In [39]:
# Datasets: {mobile_data_cleaned}

# SQL query:
query = f"""
SELECT
 activity,
 COUNT(*) activity_record
FROM `{mobile_data_cleaned}`
GROUP BY 1
LIMIT 60
"""

# Execute the query
query_df(query)

Unnamed: 0,activity,activity_record
0,UNKNOWN,699712
1,ON_FOOT,2065263
2,STILL,1355843
3,IN_VEHICLE,5349968
4,ON_BICYCLE,200495
5,,32
6,TILTING,1092941


    - Update null values as 'UNKNOWN' activity

In [40]:
# Datasets: {mobile_data_cleaned}

# SQL query: update NULL values
query = f"""
UPDATE `{mobile_data_cleaned}`
SET activity = CASE 
  WHEN activity IS NULL THEN 'UNKNOWN' 
END
WHERE activity IS NULL
"""

# Execute the query 
run_query(query)

Query successfully executed, and the table has been updated.


### Numerical Features

#### Summary Statistics

In [41]:
# Datasets: {mobile_data_cleaned}

# SQL query:
query = f"""
SELECT
 status,
 speed,
 precision,
 signal,
 satellites
FROM `{mobile_data_cleaned}`
LIMIT 10
"""

# Execute the query
query_df(query)

Unnamed: 0,status,speed,precision,signal,satellites
0,2,131.1,30.0,7,2.0
1,2,131.4,23.0,8,4.0
2,2,9.0,19.0,12,9.0
3,2,70.3,10.0,8,10.0
4,2,77.6,9.0,16,1.0
5,2,4.1,12.0,17,10.0
6,2,1.6,8.0,11,4.0
7,0,5.1,38.0,17,5.0
8,2,13.4,5.0,26,5.0
9,2,0.7,7.0,13,3.0


    - As 'status' values are directly related to the 'description' (categorical) values, we will drop the 'status' column.

In [42]:
# Datasets: {mobile_data_cleaned}

# SQL query: delete specific rows
query = f"""
ALTER TABLE `{mobile_data_cleaned}`
DROP COLUMN status
"""

# Execute the query 
run_query(query)

Query successfully executed, and the table has been updated.


Despite the option to load the entire dataset into a DataFrame and later apply the pandas .describe() function to obtain all the summary statistics for numerical features, we are going to manually calculate some of them using SQL commands.

In [43]:
# Datasets: {mobile_data_cleaned}

# SQL query:
query = f"""
SELECT
  'speed' AS metric,
  MIN(speed) AS min_value,
  CAST(MAX(speed) AS INT64) AS max_value, -- Cast as INT64
  ROUND(AVG(speed), 2) AS avg_value,
  CAST(ROUND(STDDEV_POP(speed), 2) AS INT64) AS std_value -- Cast as INT64
FROM `{mobile_data_cleaned}`

UNION ALL

SELECT
  'satellites' AS metric,
  MIN(satellites) AS min_value,
  CAST(MAX(satellites) AS INT64) AS max_value, -- Cast as INT64
  ROUND(AVG(satellites), 2) AS avg_value,
  CAST(ROUND(STDDEV_POP(satellites), 2) AS INT64) AS std_value -- Cast as INT64
FROM `{mobile_data_cleaned}`

UNION ALL

SELECT
  'precision' AS metric,
  MIN(precision) AS min_value,
  CAST(MAX(precision) AS INT64) AS max_value, -- Cast as INT64
  ROUND(AVG(precision), 2) AS avg_value,
  CAST(ROUND(STDDEV_POP(precision), 2) AS INT64) AS std_value -- Cast as INT64
FROM `{mobile_data_cleaned}`

UNION ALL

SELECT
  'signal' AS metric,
  MIN(signal) AS min_value,
  CAST(MAX(signal) AS INT64) AS max_value, -- Cast as INT64
  ROUND(AVG(signal), 2) AS avg_value,
  CAST(ROUND(STDDEV_POP(signal), 2) AS INT64) AS std_value -- Cast as INT64
FROM `{mobile_data_cleaned}`
"""

# Execute the query
query_df(query)

Unnamed: 0,metric,min_value,max_value,avg_value,std_value
0,speed,0.0,255,25.85,35
1,satellites,0.0,11503299477926,1068783.46,3506348267
2,signal,0.0,99,13.2,7
3,precision,0.0,201503299477926,28152604.87,68773279817


    - Handling 'precision' and 'satellites' possible outliers 

In [44]:
# Datasets: {mobile_data_cleaned}

# SQL query: precision values ordered DESC to check maximum values
query = f"""
SELECT
 CAST(precision AS INT64) precision
FROM `{mobile_data_cleaned}`
ORDER BY 1 DESC
LIMIT 10
"""

# Execute the query
query_df(query)

Unnamed: 0,precision
0,201503299477926
1,101503867043000
2,5304
3,3799
4,3400
5,3400
6,3400
7,3400
8,3400
9,3400


In [45]:
# Datasets: {mobile_data_cleaned}

# SQL query: satellites values ordered DESC to check maximum values
query = f"""
SELECT
 CAST(satellites AS INT64) satellites
FROM `{mobile_data_cleaned}`
ORDER BY 1 DESC
LIMIT 10
"""

# Execute the query
query_df(query)

Unnamed: 0,satellites
0,11503299477926
1,42
2,29
3,29
4,29
5,29
6,29
7,29
8,28
9,28


    - Handling outliers in 'precision' colum

Now we are going to calculate various percentiles (25th, 50th, 75th, 90th, 95th, 99th) for the 'precission' column in the table and display their values.

In [46]:
# Datasets: {mobile_data_cleaned}

# SQL query: satellites values ordered DESC to check maximum values
query = f"""
WITH Percentiles AS (
  SELECT
    APPROX_QUANTILES(precision, 100)[OFFSET(25)] AS percentile_25,
    APPROX_QUANTILES(precision, 100)[OFFSET(50)] AS median,
    APPROX_QUANTILES(precision, 100)[OFFSET(75)] AS percentile_75,
    APPROX_QUANTILES(precision, 100)[OFFSET(90)] AS percentile_90,
    APPROX_QUANTILES(precision, 100)[OFFSET(95)] AS percentile_95,
    APPROX_QUANTILES(precision, 100)[OFFSET(99)] AS percentile_99
  FROM `{mobile_data_cleaned}`
)
SELECT percentile_25, median, percentile_75, percentile_90, percentile_95, percentile_99
FROM Percentiles;
"""

# Execute the query
query_df(query)

Unnamed: 0,percentile_25,median,percentile_75,percentile_90,percentile_95,percentile_99
0,10.0,17.0,30.0,48.0,62.0,118.0


In [47]:
# Datasets: {mobile_data_cleaned}

# SQL query: 
query = f"""
SELECT
 COUNT(*) precision_g_t_121_count
FROM `{mobile_data_cleaned}`
WHERE precision > 121
"""

# Execute the query
query_df(query)

Unnamed: 0,precision_g_t_121_count
0,106577


    - Percentile 99: the top 1% of values (106577 rows) have precision values greater than 121.

    - How would the removal of these outliers affect the data distribution, and would this impact be consistent across different provinces?

This following query calculates and compares outlier percentages in different provinces based on the 'precision' column, where values above 121 are considered outliers. It provides insights into how outliers are distributed among provinces.

In [48]:
# Datasets: {mobile_data_cleaned}

# SQL query: 
query = f"""
SELECT
  province,
  COUNT(*) AS province_records,
  SUM(CASE WHEN precision > 121 THEN 1 ELSE 0 END) AS province_outliers,
  ROUND((SUM(CASE WHEN precision > 121 THEN 1 ELSE 0 END) / COUNT(*)) * 100, 2) AS percentage_outliers
FROM `{mobile_data_cleaned}`
GROUP BY 1
ORDER BY 1
"""

# Execute the query
query_df(query)

Unnamed: 0,province,province_records,province_outliers,percentage_outliers
0,Barcelona,7367393,77074,1.05
1,Girona,1391076,14618,1.05
2,Lleida,1003705,4417,0.44
3,Tarragona,1002080,10468,1.04


    - Barcelona, Girona and Tarragona would have a similar impact (around 1.05% of their data) if these outliers are removed. Only Lleida would have a smaller impact (0.44%).
    
    - We are going to proceed with the deletion of 'precision' values higher than 121.

In [49]:
# Datasets: {mobile_data_cleaned}

# SQL query: delete outliers in 'precision' column
query = f"""
DELETE FROM `{mobile_data_cleaned}`
WHERE precision > 121
"""

# Execute the query
run_query(query)

Query successfully executed, and the table has been updated.


    - As a result of removing the 'precision' outliers, the outlier in the 'satellites' column, which had a value of 11503299477926, has also been eliminated from the dataset.

In [50]:
# Datasets: {mobile_data_cleaned}

# SQL query: satellites values ordered DESC to check maximum values
query = f"""
SELECT
 CAST(satellites AS INT64) satellites
FROM `{mobile_data_cleaned}`
ORDER BY 1 DESC
LIMIT 10
"""

# Execute the query
query_df(query)

Unnamed: 0,satellites
0,42
1,29
2,29
3,29
4,29
5,29
6,29
7,28
8,28
9,28


### Save Cleaned Table

If required, the fully cleaned dataset could be stored under the name 'mobile_data_2015_2017_cleaned_final' for use in subsequent analysis and reporting.

In [51]:
# Datasets: {mobile_data_cleaned}

# SQL query: 
query = f"""
CREATE OR REPLACE TABLE `bq-analyst-230590.project_cat_mobile_coverage_2015_2017.mobile_data_2015_2017_cleaned_final`
AS
SELECT *
FROM `{mobile_data_cleaned}`
"""

# Execute the query
# run_query(query)

### Cleaned Table

In [52]:
# Datasets: {mobile_data_cleaned}

# SQL query: 
query = f"""
SELECT *
FROM `{mobile_data_cleaned}`
#LIMIT 20
"""

# Execute the query
cleaned_data = query_df(query)

# Display data
display(cleaned_data.head(), cleaned_data.tail(), cleaned_data.shape, cleaned_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10657677 entries, 0 to 10657676
Data columns (total 18 columns):
 #   Column         Dtype  
---  ------         -----  
 0   date           object 
 1   hour           object 
 2   lat            float64
 3   long           float64
 4   signal         int64  
 5   net_provider   object 
 6   operator       object 
 7   description    object 
 8   net            object 
 9   speed          float64
 10  satellites     float64
 11  precision      float64
 12  provider       object 
 13  activity       object 
 14  postal_code    object 
 15  town_name      object 
 16  position_geom  object 
 17  province       object 
dtypes: float64(5), int64(1), object(12)
memory usage: 1.4+ GB


Unnamed: 0,date,hour,lat,long,signal,net_provider,operator,description,net,speed,satellites,precision,provider,activity,postal_code,town_name,position_geom,province
0,2015-10-25,10:14:37,41.74896,2.11788,3,Vodafone,Vodafone,STATE_IN_SERVICE,4G,0.0,6.0,22.0,GPS,UNKNOWN,80641,Castellterçol,POINT(2.11788 41.74896),Barcelona
1,2015-05-18,17:06:35,41.62369,2.29211,21,Vodafone,Vodafone,STATE_EMERGENCY_ONLY,3G,45.7,4.0,63.0,GPS,IN_VEHICLE,80863,les Franqueses del Vallès,POINT(2.29211 41.62369),Barcelona
2,2016-07-11,13:33:55,41.44238,1.86417,6,Vodafone,Vodafone,STATE_EMERGENCY_ONLY,3G,5.0,4.0,10.0,GPS,IN_VEHICLE,80919,Gelida,POINT(1.86417 41.44238),Barcelona
3,2015-02-11,10:17:18,41.35198,2.12417,18,Vodafone,Vodafone,STATE_IN_SERVICE,3G,0.0,0.0,51.0,Fused,STILL,81017,l'Hospitalet de Llobregat,POINT(2.12417 41.35198),Barcelona
4,2016-01-12,09:19:47,41.35391,2.12155,3,Vodafone,Vodafone,STATE_EMERGENCY_ONLY,4G,6.9,12.0,18.0,GPS,IN_VEHICLE,81017,l'Hospitalet de Llobregat,POINT(2.12155 41.35391),Barcelona


Unnamed: 0,date,hour,lat,long,signal,net_provider,operator,description,net,speed,satellites,precision,provider,activity,postal_code,town_name,position_geom,province
10657672,2015-06-09,14:03:44,41.34395,2.10966,20,Yoigo,Yoigo,STATE_IN_SERVICE,2G,23.0,4.0,36.0,GPS,IN_VEHICLE,81017,l'Hospitalet de Llobregat,POINT(2.10966 41.34395),Barcelona
10657673,2015-08-31,08:37:48,41.49876,2.19254,10,Yoigo,Yoigo,STATE_EMERGENCY_ONLY,2G,94.4,6.0,8.0,GPS,IN_VEHICLE,81252,Montcada i Reixac,POINT(2.19254 41.49876),Barcelona
10657674,2016-02-11,07:36:02,41.55401,2.11517,7,Yoigo,Yoigo,STATE_EMERGENCY_ONLY,2G,45.1,5.0,32.0,GPS,IN_VEHICLE,81878,Sabadell,POINT(2.11517 41.55401),Barcelona
10657675,2015-04-18,22:51:12,41.34886,1.69144,19,Yoigo,Yoigo,STATE_EMERGENCY_ONLY,3G,5.7,0.0,35.0,Fused,ON_FOOT,83054,Vilafranca del Penedès,POINT(1.69144 41.34886),Barcelona
10657676,2017-04-28,18:56:10,41.93294,1.19846,8,Yoigo,Yoigo,STATE_EMERGENCY_ONLY,2G,45.6,2.0,10.0,GPS,IN_VEHICLE,250426,la Baronia de Rialb,POINT(1.19846 41.93294),Lleida


(10657677, 18)

None