## Classification - Acquistion Exercises

### Use a python module (pydata or seaborn datasets) containing datasets as a source from the iris data. Create a pandas dataframe, df_iris, from this data.

- print the first 3 rows
- print the number of rows and columns (shape)
- print the column names
- print the data type of each column
- print the summary statistics for each of the numeric variables. Would you recommend rescaling the data based on these statistics?

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import sklearn.preprocessing
import sklearn.model_selection

import env
import acquire
import prepare

In [None]:
# Load the iris data from seaborn
# print the first 3 rows

df_iris = sns.load_dataset('iris')
df_iris.head(3)

In [None]:
# print the number of rows and columns (shape)

df_iris.shape

In [None]:
# print the column names

df_iris.columns.tolist()

In [None]:
# another way

list(df_iris.columns.values)

In [None]:
# print the data type of each column

df_iris.info()

In [None]:
# print the summary statistics for each of the numeric variables.
# Would you recommend rescaling the data based on these statistics?

df_iris.describe()

In [None]:
# I wouldn't recommend rescaling the data

### Read Table1_CustDetails the excel module dataset, Excel_Exercises.xlsx, into a dataframe, df_excel

- assign the first 100 rows to a new dataframe, df_excel_sample
- print the number of rows of your original dataframe
- print the first 5 column names
- print the column names that have a data type of object
- compute the range for each of the numeric variables.

In [None]:
custdetails = pd.read_excel('Spreadsheets_Exercises_Solutions.xlsx', sheet_name ='Table1_CustDetails')

In [None]:
custdetails.head()

In [None]:
# assign the first 100 rows to a new dataframe, df_excel_sample

df_excel_sample = custdetails.head(100)

In [None]:
df_excel_sample

In [None]:
# print the number of rows of your original dataframe

len(custdetails.index)

print(f'My original DataFrame has {custdetails.shape[0]} rows.')

In [None]:
# or

custdetails.shape

In [None]:
# print the first 5 column names

custdetails.columns[:5]

In [None]:
# print the column names that have a data type of object

custdetails.info()

In [None]:
custdetails.select_dtypes(include='object')

In [None]:
# as a loop

for col in custdetails:
    if custdetails[col].dtype == 'O':
        print(col)

In [None]:
# Leave the loops in the 80s and do it with a list comprehension!

[col for col in custdetails if custdetails[col].dtype == 'O']

In [None]:
# compute the range for each of the numeric variables

monthly_charges = custdetails['monthly_charges']
monthly_charges_range = monthly_charges.max() - monthly_charges.min()

In [None]:
monthly_charges_range

In [None]:
total_charges = custdetails['total_charges']
total_charges_range = total_charges.max() - total_charges.min()

In [None]:
total_charges_range

In [None]:
# another way to do it

numeric_df = custdetails.select_dtypes(['int64', 'float64'])
print(f'The range of each numeric columns is:')
print('-------------------------------------')
print(round(numeric_df.max() - numeric_df.min(),2))

In [None]:
custdetails.describe().transpose()

### Read the data from this google sheet into a dataframe, df_google

- print the first 3 rows
- print the number of rows and columns
- print the column names
- print the data type of each column
- print the summary statistics for each of the numeric variables
- print the unique values for each of your categorical variables

In [None]:
# Read the data from this google sheet into a dataframe, df_google

sheet_url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'    

csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')

df_google = pd.read_csv(csv_export_url)

# print the first 3 rows

df_google.head(3)

In [None]:
# print the number of rows and columns

df_google.columns.tolist()

In [None]:
df_google.index.values

In [None]:
# print the data type of each column

df_google.info()

In [None]:
# print the summary statistics for each of the numeric variables

df_google.describe()

In [None]:
# print the unique values for each of your categorical variables

df_google['Name'].unique()

In [None]:
df_google['Sex'].unique()

In [None]:
df_google['Ticket'].unique()

In [None]:
df_google['Cabin'].unique()

In [None]:
# another way
print(df_google.Name.unique())
print('----------------------------')
print(df_google.Name.nunique())

In [None]:
df_google['Embarked'].unique()

In [None]:
def cat_uniques(df):
    for col in df:
        if df[col].dtype == 'O':
            print(col)
            print('-------------')
            print(df[col].value_counts(dropna=False))
            print('-------------')

In [None]:
cat_uniques(df_google)

### In a new python module, acquire.py:

#### get_titanic_data: returns the titanic data from the codeup data science database as a pandas data frame.

#### get_iris_data: returns the data from the iris_db on the codeup data science database as a pandas data frame. The returned data frame should include the actual name of the species in addition to the species_ids.

In [None]:
titanic = acquire.get_titanic_data()

In [None]:
titanic.head()

In [None]:
iris_db = acquire.get_iris_data()

In [None]:
iris_db.head()

## Classification - Prep Exercises

### The end product of this exercise should be the specified functions in a python script named prepare.py. Do these in your classification_exercises.ipynb first, then transfer to the prepare.py file.

#### Iris Data

- Use the function defined in acquire.py to load the iris data.
- Drop the species_id and measurement_id columns.
- Rename the species_name column to just species.
- Encode the species name using a sklearn label encoder. Research the inverse_transform method of the label encoder. How might this be useful?
- Create a function named prep_iris that accepts the untransformed iris data, and returns the data with the transformations above applied.

In [None]:
df = acquire.get_iris_data()

In [None]:
df.head()

In [None]:
# Drop the species_id and measurement_id columns.

df = df.drop(columns=['species_id', 'measurement_id'])

In [None]:
df.head()

In [None]:
# Rename the species_name column to just species.

df.rename(columns={'species_name':'species'}, inplace=True)

In [None]:
df.head()

In [None]:
# Encode the species name using a sklearn label encoder.
# Research the inverse_transform method of the label encoder. How might this be useful?

# inverse_transform(self, y) Transform labels back to original encoding.

# Make the thing

label_encoder = sklearn.preprocessing.LabelEncoder()

# Fit the thing

label_encoder.fit(df[['species']])

# Transform the thing

m = label_encoder.transform(df[['species']])

In [None]:
label_encoder.classes_

In [None]:
m.shape

In [None]:
m = pd.DataFrame(m)

In [None]:
m

In [None]:
df.head()

In [None]:
df = pd.concat([df, pd.DataFrame(m)], axis=1)

In [None]:
df.rename(columns={0:'species_encoding'}, inplace=True)

In [None]:
df.sample(20)

In [None]:
# Create a function named prep_iris that accepts the untransformed iris data, and
# returns the data with the transformations above applied.

def encode_species(train, test):
    encoder = sklearn.preprocessing.OneHotEncoder()
    encoder.fit(train[['species']])
    
    cols = ['species_' + c for c in encoder.categories_[0]]
    
    m = encoder.transform(train[['species']]).todense()

    train = pd.concat([train,pd.DataFrame(m, columns=cols, index=train.index)], axis=1)
    
    m = encoder.transform(test[['species']]).todense()
    
    test = pd.concat([test,pd.DataFrame(m, columns=cols, index=test.index)], axis=1)
    
    return train, test
    

def prep_iris(df):
    df = df.drop(columns=['species_id'])
    df = df.rename(columns={'species_name':'species'})
    train, test = sklearn.model_selection.train_test_split(df, train_size=.8, random_state=19)
    train, test = encode_species(train, test)
    return train, test

In [None]:
df = acquire.get_iris_data()

In [None]:
df.head()

In [None]:
train, test = prep_iris(df)

In [None]:
train.head()

In [None]:
test.shape

### Titanic Data

- Use the function you defined in acquire.py to load the titanic data set.
- Handle the missing values in the embark_town and embarked columns.
- Remove the deck column.
- Use a label encoder to transform the embarked column.
- Scale the age and fare columns using a min max scaler. Why might this be beneficial? When might you not want to do this?
- Create a function named prep_titanic that accepts the untransformed titanic data, and returns the data with the transformations above applied.m

In [2]:
df = acquire.get_titanic_data()

In [None]:
df.head()

In [None]:
df.embark_town = df.embark_town.fillna(df.embark_town.value_counts().head(1).index[0])

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
train, test = sklearn.model_selection.train_test_split(df, train_size=.8, random_state=19)

In [None]:
train.head()

In [None]:
# Handle the missing values in the embark_town and embarked columns.
train.embark_town.value_counts()

In [None]:
train.embark_town = train.embark_town.fillna('Southampton')
test.embark_town = test.embark_town.fillna('Southampton')

In [None]:
train.isnull().sum()

In [None]:
pd.crosstab(train.embarked, train.embark_town)

In [None]:
train.embark = train.embarked.fillna('Southampton')
test.embark = test.embarked.fillna('Southampton')

In [None]:
test.isna().sum()

In [None]:
test.embark_town.value_counts()

In [None]:
# Remove the deck column.

train = train.drop(columns='deck')

In [None]:
test = test.drop(columns='deck')

In [None]:
train.head()

In [None]:
# Use a label encoder to transform the embarked column.

train['embarked'] = train['embarked'].astype(str)
test['embarked'] = test['embarked'].astype(str)
# make the thing
encoder = sklearn.preprocessing.OneHotEncoder()

# fit the thing
encoder.fit(train[['embarked']])

cols = ['embarked_' + c for c in encoder.categories_[0]]

#transform the thing
# .todense to convert from sparse matrix to plain old 2d numpy
m = encoder.transform(train[['embarked']]).todense()

train = pd.concat([
    train,
    pd.DataFrame(m, columns=cols, index=train.index)
], axis=1)

m = encoder.transform(test[['embarked']]).todense()

test = pd.concat([
    test,
    pd.DataFrame(m, columns=cols, index=test.index)
], axis=1)

In [None]:
train.head()

In [None]:
def drop_columns(df):
    return df.drop(columns="deck")

def fill_na(df):
    df.embark_town = df.embark_town.fillna('Southampton')
    df.embarked = df.embarked.fillna('S')
    return df
    
def encode_titanic(train, test):
    encoder = sklearn.preprocessing.OneHotEncoder()
    encoder.fit(train[["embarked"]])

    m = encoder.transform(train[["embarked"]]).todense()

    train = pd.concat([train, pd.DataFrame(m, columns=encoder.categories_[0], index=train.index)], axis=1)

    m = encoder.transform(test[["embarked"]]).todense()

    test = pd.concat([train, pd.DataFrame(m, columns=encoder.categories_[0], index=test.index)], axis=1)

    return train, test

def impute_titanic(train, test):
    imputer = sklearn.impute.SimpleImputer(strategy='mean')
    imputer.fit(train[['age']])
    train.age = imputer.transform(train[['age']])
    test.age = imputer.transform(test[['age']])

    return train, test

def scale_titanic(train, test):
    train_scaled = train[['age', 'fare']]
    test_scaled = test[['age', 'fare']]
    scaler, train_scaled, test_scaled = split_scale.min_max_scaler(train_scaled, test_scaled)
    return scaler, train_scaled, test_scaled

In [None]:
df = drop_columns(df)

In [None]:
df = fill_na(df)

In [None]:
train, test = sklearn.model_selection.train_test_split(df, random_state=19, train_size = .8)

In [None]:
train, test = encode_titanic(train, test)

In [4]:
scaler, train, test = prepare.prep_titanic(df)

In [5]:
train.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,C,Q,S
263,263,0,1,male,0.537918,0,0,0.0,S,First,Southampton,1,0.0,0.0,1.0
395,395,0,3,male,0.293286,0,0,0.015216,S,Third,Southampton,1,0.0,0.0,1.0
595,595,0,3,male,0.483555,1,1,0.047138,S,Third,Southampton,0,0.0,0.0,1.0
94,94,0,3,male,0.79614,0,0,0.014151,S,Third,Southampton,1,0.0,0.0,1.0
887,887,1,1,female,0.252514,0,0,0.058556,S,First,Southampton,1,0.0,0.0,1.0
