## Align Ookla data at the census tract level for Florida

👋 Hi, I’m @Eric-G-Romano
👀 I’m interested in all things Machine Learning
🌱 I’m currently working on a recommendation systems and honing my Natural Language Processing skills
💞️ I’m looking to collaborate on projects regarding Machine Learning, Data Science and Data Analytics
📫 How to reach me: eric.romano@eriseconsulting.com
📰 Check out my github: https://github.com/Eric-G-Romano
📰 Here a link to my blog https://eg-romano.medium.com/

### Task1 : Re-aggregate Florida Ookla data across quarters to get a yearly median for 2021. 
This was already completed. Ookla decide to use means instead of median

### Task2 : Re-aggregate Ookla data to define at the census tract level. Consider statistical methods to re-aggregate the medians across time and up to tract levels, as well as partial/missing data for the tract level

For example: Average the medians in a single census tract, how to handle ookla data (quadtile) that goes across census tracts, how to handle a quadtile metric that is partially in one census tract, How to best find a distribution of medians across census tracts.

In [None]:
from datetime import datetime

import geopandas as gp
import matplotlib 
import matplotlib.pyplot as plt
import pandas as pd 
import numpy as np 

from shapely.geometry import Point 
from adjustText import adjust_text

In [168]:
from functools import partial
from pyproj import Proj, transform

In [169]:
florida_ookla = pd.read_csv('ookla_combined_fl.csv')

In [170]:
florida_ookla_drop = florida_ookla.drop(['STATEFP','COUNTYFP','TRACTCE','AFFGEOID','NAME','LSAD','ALAND','AWATER'], axis=1)
florida_ookla_drop.head()

Unnamed: 0,quadkey,avg_d_kbps,avg_u_kbps,avg_lat_ms,tests,devices,type,quarter,year,GEOID,tile
0,320201332021332,90726,28194,35,3,2,fixed,Q1,2021,12037970302,"list(c(-85.001220703125, -84.9957275390625, -8..."
1,320231100220123,159542,31827,14,60,10,fixed,Q1,2021,12099000905,"list(c(-80.1287841796875, -80.123291015625, -8..."
2,320231102200112,211000,52752,14,37,18,fixed,Q1,2021,12099005916,"list(c(-80.123291015625, -80.1177978515625, -8..."
3,320231223033003,115365,20575,15,7,5,fixed,Q1,2021,12087971200,"list(c(-81.0736083984375, -81.068115234375, -8..."
4,320212301232330,217592,48549,9,21,8,fixed,Q1,2021,12057011906,"list(c(-82.496337890625, -82.4908447265625, -8..."


In [171]:
count= pd.DataFrame(florida_ookla_drop['GEOID'].value_counts())
count.reset_index(inplace=True)
count.rename(columns={"GEOID":"count", "index":"GEOID"}, inplace=True)
count.head()

Unnamed: 0,GEOID,count
0,12047960200,1336
1,12079110100,1105
2,12089050301,987
3,12095017103,975
4,12021011202,784


In [172]:
#After communicating with domain experts our approach to address the weight when average
#was selecting the test for each aggregated quadkey

In [173]:
# Applying a lambda function to apply the weighted average across the column of interest
weight_avg = lambda x: np.average(x, weights=florida_ookla_drop.loc[x.index, 'tests'])
aggargated = florida_ookla_drop.groupby('GEOID').agg(d_kbps_wm = ('avg_d_kbps', weight_avg),
                                                     u_kbps_wm = ('avg_u_kbps', weight_avg))
aggargated.reset_index(inplace=True)
# Adding columns that are in the same range as FCC data
aggargated['d_bps_wm'] = aggargated['d_kbps_wm'].div(1000)
aggargated['u_bps_wm'] = aggargated['u_kbps_wm'].div(1000)

# Head
aggargated.head()

Unnamed: 0,GEOID,d_kbps_wm,u_kbps_wm,d_bps_wm,u_bps_wm
0,12001000200,105032.754221,31521.242026,105.032754,31.521242
1,12001000301,101549.70438,10910.116788,101.549704,10.910117
2,12001000302,126246.534884,15840.403101,126.246535,15.840403
3,12001000400,121749.295652,47429.365217,121.749296,47.429365
4,12001000500,110116.405263,33817.733333,110.116405,33.817733


In [174]:
len(aggargated)

4116

In [None]:
## The GEOID from the ookla data does not completely match the FCC data.
# 1) The method used to connect the quadkey tile to GEOID might not take into account quadkeys that are
#    cover multipe GEOID. 
# 2) Ookla might not be performing test in these missing areas

In [175]:
#Some quadkeys overlap with the intersection of census tracts.
#I would like to explore the percentage of quadkey that falls in a census tract area. 

In [176]:
florida_fcc = pd.read_csv('fcc_477_census_tract_FL.csv')
florida_fcc.head()

Unnamed: 0,tract,max_dn,max_up,dn10,dn100,dn250,fiber_100u,state
0,12001000200,979.959184,53.995265,1.92517,1.0,1.0,0.020408,12
1,12001000301,990.015,34.65384,1.56,1.06,1.0,0.0,12
2,12001000302,989.663158,44.797558,1.178947,1.010526,1.0,0.010526,12
3,12001000400,947.382979,33.170979,1.329787,0.946809,0.946809,0.0,12
4,12001000500,967.066914,392.588372,2.130112,1.405204,1.364312,0.371747,12


In [177]:
len(florida_fcc)

4205

In [178]:
def return_agg(dataframe_per_state_Ookla):
    # Applying a lambda function to apply the weighted average across the column of interest
    weight_avg = lambda x: np.average(x, weights=dataframe_per_state_Ookla.loc[x.index, 'tests'])
    aggargated = dataframe_per_state_Ookla.groupby('GEOID').agg(d_kbps_wm = ('avg_d_kbps', weight_avg),
                                                     u_kbps_wm = ('avg_u_kbps', weight_avg))
    aggargated.reset_index(inplace=True)
    # Adding columns that are in the same range as FCC data
    aggargated['d_bps_wm'] = aggargated['d_kbps_wm'].div(1000)
    aggargated['u_bps_wm'] = aggargated['u_kbps_wm'].div(1000)
    
    return aggargated

In [179]:
return_agg(florida_ookla_drop)

Unnamed: 0,GEOID,d_kbps_wm,u_kbps_wm,d_bps_wm,u_bps_wm
0,12001000200,105032.754221,31521.242026,105.032754,31.521242
1,12001000301,101549.704380,10910.116788,101.549704,10.910117
2,12001000302,126246.534884,15840.403101,126.246535,15.840403
3,12001000400,121749.295652,47429.365217,121.749296,47.429365
4,12001000500,110116.405263,33817.733333,110.116405,33.817733
...,...,...,...,...,...
4111,12133970104,52912.755319,9752.546099,52.912755,9.752546
4112,12133970200,19064.187845,3735.530387,19.064188,3.735530
4113,12133970301,24254.752000,5218.992000,24.254752,5.218992
4114,12133970302,99470.279793,10743.829016,99.470280,10.743829
