# Machine learning Playground
Started 3/4/2022

This is going to be where I test out new data visualizations, machine learning models, and methods of cleaning and preprocessing data.

This is the first time I've committed to using Jupyter Notebooks, so it'll be a learning experience, but I intend to document as much as I can :).

In [15]:
import json
import pandas as pd
import urllib.request as rq

## Data Prep
Below we have the functions, global variables, and function calls necessary to get the data from OWID and put it in a usable format.

In [16]:

COVID_LATEST_DATA_URL = "https://covid.ourworldindata.org/data/latest/owid-covid-latest.json"
COVID_WHOLE_DATA_URL = "https://covid.ourworldindata.org/data/owid-covid-data.json"


def get_covid_df(url:str) -> pd.DataFrame:
    """Helper function for data extraction tasks"""
    with rq.urlopen(url) as url:
        covid_data = json.loads(url.read().decode())
        covid_df = pd.DataFrame(covid_data)
        return covid_df

def filter_data(covid_df:pd.DataFrame, country:str) -> pd.DataFrame:
    """Filter for the given country"""
    filtered_df = covid_df[country].copy()
    return filtered_df

def extract_full_country_data(data:pd.DataFrame) -> pd.DataFrame:
    """Creates a dataframe from the data dictionary nested in the covid data"""
    df = pd.DataFrame.from_dict(data.data)
    return df

In [17]:
covid_df = get_covid_df(COVID_WHOLE_DATA_URL)
filtered_df = filter_data(covid_df, "USA")
full_df = extract_full_country_data(filtered_df)

## Getting the Label
Below are the functions and function calls necessary to extract the label from the dataset.

One thing I wanted to clarify, `clean_NaN()` can be a misleading name. In that function all I'm doing is filling the na values with 0's. This is an easy area for improvement in the future.

In [18]:
def extract_label_column(data:pd.DataFrame, column:str) -> pd.Series:
    """Create a copy of the column and return it as a Series"""
    series = data[column].copy().rename('')
    return series

def clean_NaN(data:pd.DataFrame) -> pd.DataFrame:
    df = data.copy().fillna(0)
    return df

In [19]:
base_y = extract_label_column(full_df, "new_cases")
cleaned_y = clean_NaN(base_y)

## Getting the Features

 Below are the functions and function calls necessary for extracting the features from the full dataset.

 This section differs from the label in a few ways. Note that there's a global variable called `DROP_COLUMNS` in which I have listed some features I believe may cause overfitting due to their relationship to the label.

 Also we're beginning to do some preprocessing in that we narrow down which features we'll be using via `SelectKBest` and then scaling them with sklearn's `StandardScaler`.

In [20]:
DROP_COLUMNS = [
    'date',
    'new_cases',
    'new_cases_per_million',
    'new_cases_smoothed',
    'new_cases_smoothed_per_million',
    'tests_units'
]

def remove_overfit_columns(data:pd.DataFrame, columns:list) -> pd.DataFrame:
    """Removes columns that might cause overfit, such as the per million and smoothed versions of our label"""
    data = data.copy().drop(columns = columns)
    return data

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression

def optimize_feature_columns(X:pd.DataFrame, num_features:int, y:pd.Series) -> pd.DataFrame:
    """Create a copy of the columns and return them as a dataframe"""
    feature_selection = SelectKBest(score_func=f_regression, k=num_features)
    X_selected = feature_selection.fit_transform(X.copy(), y)
    X_selected = pd.DataFrame(X_selected)
    return X_selected

def scale_data(X:pd.DataFrame) -> pd.DataFrame:
    """Uses StandardScaler to scale the data"""
    X = StandardScaler().fit_transform(X)
    X = pd.DataFrame(X)
    return X

In [None]:
base_X = remove_overfit_columns(full_df, DROP_COLUMNS)
cleaned_X = clean_NaN(base_X)
optimal_X = optimize_feature_columns(cleaned_X, 10, cleaned_y)
scaled_X = scale_data(optimal_X)