# Lecture 2 Data Handling
__MATH 3480__ - Dr. Michael Olson

Outline:
* Obtaining Data
* Loading Data
* Cleaning Data
   * Missing Labels
   * Missing Values
* Data Wrangling
   * Joins and Merges

Reading
* Geron, Chapter 2 (pp. 42-51, 62-72)
* KcKinney, Chapter 7 (pp. 203-209), Chapter 8 (pp. 253-268)

## Obtaining Data

From Math 3080, we saw the following ways to obtain data:
* Online websites (Kaggle, Data Centers)
* Web scraping
* Application Programming Interfaces (APIs)

In [None]:
###   Online Websites   ###

import pandas as pd

# Crime Statistics between 2000 and 2020 - Used in Math 3080
# https://github.com/drolsonmi/math3080/blob/main/Datasets/Crime_Statistics_2000-2020.csv

crimes = pd.read_csv('https://raw.githubusercontent.com/drolsonmi/math3080/main/Datasets/Crime_Statistics_2000-2020.csv')

In [None]:
###   Web Scraping   ###


In [None]:
###   APIs   ###

import pandas as pd
import requests

# First, you need to register for an account with the website supplying the API.
# When registering, you will need an authorization key

url = "https://api.yelp.com/v3/businesses/search"

headers = {
    "accept": "application/json",
    "Authorization": "Bearer hZU3WOBIK3jklJqIzew0uDFK_vjSYmoKToQQejrQuceKPGu8SF6M_-SuAT7asN6RNldA_kZvQGrE-3vh-RuQxHxRNUUKkHeRk03p_RLCQcO6ZZvHKMHoR5sEh7f3Y3Yx"
}

params = {
    "term" : "restaurants",
    #"latitude" : 40.77,
    #"longitude" : -111.9,
    "location" : "Ephraim, UT",
    "radius" : 5000,                  ## Units????
    "limit" : 50
}

response = requests.get(url, headers=headers, params=params)

## Loading Data

In [None]:
import numpy as np
import pandas as pd


## From Seaborn
import seaborn as sns
iris = sns.load_dataset('iris')



In [None]:
import pandas as pd

## From SciKit-Learn
from sklearn.datasets import load_iris
iris2 = load_iris()

iris_df = pd.DataFrame(iris2['data'], columns=iris2['feature_names'])
iris_df

In [None]:


## From a File
X = np.loadtxt('Data/X.txt')                      # Numpy array
ins = pd.read_csv('Data/insurance.csv', sep=",")  # Pandas DataFrame


## From a website
crimes = pd.read_csv('https://raw.githubusercontent.com/drolsonmi/math3080/main/Datasets/Crime_Statistics_2000-2020.csv')

In [None]:
###  Example from Geron textbook that automates the entire process  ###
import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

# This function creates a datasets/housing subdirectory on your computer, then
# downloads the file into that directory
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

# This function opens up the dataset
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [None]:
fetch_housing_data()            # Download the file
housing = load_housing_data()   # Load the file

## Cleaning Data

Problems that can come up with data:
* Unlabelled data
* Missing data
* Unorganized data

All data needs to be labelled. If the labels aren't in the file, then there should be another file that explains the data. Labels are often described in
* separate README file
* top of the data file



Ways that missing data could be indicated:
* An extreme number (9999, -9999)
* NaN (Not a Number)
* Blank entry (no information) - programs usually fill these with NaN

Let's use the dataset of Titanic passengers. We'll load them below. Data descriptions found at [https://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/TitanicMETA.pdf](https://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/TitanicMETA.pdf)

In [None]:
from sklearn.datasets import fetch_openml
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True, parser='auto')

df = X.copy()
df['survived'] = y
df.head()

In [None]:
X.head()

In [None]:
X.shape

Determine if there are any missing values

In [None]:
df.isnull().sum()

Solutions to missing data:
* Drop the observations
   * If there are a large number of observations and only a few have missing values
* Drop the variable
   * If there are an extremely large percentage of data from a given variable that are missing
* Fill in ...
   * ... with variable mean
   * ... with variable maximum
   * ... with variable minimum

In [None]:
# Drop variables
df.drop('body', axis=1, inplace=True)
        
# Drop observations
df['age'].dropna(axis=0, inplace=True)

In [None]:
# Fill in with variable mean
## Simple Method
df['fare'].fillna(df['fare'].mean(), inplace=True)
  # Can also use `axis=1` to do this for all numerical columns

df.isnull().sum()

In [None]:
# Fill in with variable mean
## Machine Learning Method
  # Can be applied to multiple columns at the same time

# import the SimpleImputer class from the impute module of the sklearn library
from sklearn.impute import SimpleImputer 

# Create an object using the SimpleImputer class
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

imputer.fit(X['fare'].values.reshape(-1,1))
X['fare'] = imputer.transform(X['fare'].values.reshape(-1,1))

X.isnull().sum()

## Data Wrangling