# Data science and Python in Juypter
### A bootcamp tutorial on using Python in Jupyter notebooks for Data Science

In this notebook, we demonstrate a data science task on cleaned data using Pandas, a popular tool for analyzing and modeling tabular data.  The data that we will use is provided by here, and is included in the repository for convenience.

Similarly to R, we start our processing by importing the packages that we need.  We will use pandas for our processing, which provides functionality for manipulating tabular data and other data types.

In [None]:
#import statements
import pandas as pd

In [None]:
#magics
%matplotlib inline

## Loading and viewing the data

In [None]:
na_list = []
na_list = ['?']

In [None]:
#Load data
filename = 'data/adult_data.csv'
df = pd.read_csv(filename, skipinitialspace=True, na_values=na_list)

In [None]:
# Get a preview of the data
df.head()

In [None]:
df.info(null_counts=True) #But what does pandas consider to be null/na?

In [None]:
df.info(verbose=False)

In [None]:
df.describe()

In [None]:
#Count na values and/or replace them with other things
df.isna().sum()

In [None]:
df.columns

In [None]:
# Look at the value counts for all of the columns: categorical histogram
for x in df.columns:
    if df[x].dtype=='object':
        print('Col name: ', x, '\n', df[x].value_counts(), '\n')

## Selecting data from the dataframe

In [None]:
#Select certain columns by name
occupation_info = df['occupation'] #shorthand
occupation_info = df.loc[:, 'occupation'] #using loc operator
occupation_info = df.loc[:, ['occupation', 'marital-status', 'relationship']] #select multiple columns using a list
ocupation_info = df.iloc[:, [3,5,7]] #select columns using their integer location

In [None]:
#Select certain rows
first_n_rows = df.loc[0:10, :] #select rows by index NAME
first_n_rows = df.iloc[0:10, :] #select rows by integer location

In [None]:
#Select columns according to some criteria
nm_bools = df.loc[:, 'marital-status'] != 'Never-married'
n_nm_df = df.loc[nm_bools, :]
n_nm_df.head()

In [None]:
#Select and view data for all non-'Private' workclasses
n_np_df = df.loc[ df.loc[:, 'workclass'] != 'Private']
n_np_df.head()

## Handling missing values

In [None]:
#Simplest approach if datset large enough: drop the missing values
test=df.dropna(axis=0)
test.shape
df.dropna(axis=0, inplace=True)
df.reset_index(inplace=True, drop=True)

In [None]:
df.isna().sum()

In [None]:
df['salary-class'].value_counts()

In [None]:
#You can also plot directly using Pandas.  Pandas plots are based on the matplotlib library.
df['education-num'].hist(bins=15)

In [None]:
??pd.Series.hist

## Modeling the data (Alt 1)
In this step, we'll model the data using some parts of the data as predictors and one column as the response.  For the predictors, we will use \[age, workclass, capital-gain, capital-loss, hours-per-week\], and for the response, we will use salary-class.  Some of these variables will require some preprocessing to get them into a form suitable for a classifier.

In [None]:
#imports
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

In [None]:
#Divide data into training and testing sets.  This is essential so that information that should not be known to the testing set is not known.

#Split data into relevant portions
data_y = df['salary-class']
data_x = df[['age', 'hours-per-week', 'workclass']]

ptrain_x, ptest_x, ptrain_y, ptest_y = train_test_split(data_x, data_y, train_size=0.85)

### Encodings for predictors

In [None]:
# One hot encode workclass; note that if we wanted to, we could one hot all of the categorical matrices
wc_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
wc_1h = wc_encoder.fit_transform(ptrain_x['workclass'].to_frame())
wc_df = pd.DataFrame(wc_1h, columns=wc_encoder.get_feature_names())

In [None]:
print(wc_df.shape)
wc_df.head()

In [None]:
#Scale inputs to logistic regression
num_scaler = StandardScaler()
ns = num_scaler.fit_transform(ptrain_x[['age','hours-per-week']])
ns_df = pd.DataFrame(ns, columns=['age', 'hours-per-week'])

In [None]:
print(ns_df.shape)
ns_df.head()

### Encodings for responses

In [None]:
#binary encode label
sc_encoder = LabelEncoder()
sc = sc_encoder.fit_transform(ptrain_y)
train_y = pd.Series(sc, name='salary-class')

In [None]:
train_y.head()
print(train_y.shape)

In [None]:
#Concatenate all of the features together
#feat_df = df[['age', 'capital-gain', 'capital-loss', 'hours-per-week']].copy()
train_x = pd.concat([wc_df, ns_df], sort=False, axis=1)

In [None]:
print(train_x.shape)
train_x.head()

In [None]:
print(train_y.shape)
train_y.head()

### Modeling and Prediction

In [None]:
#Create a classifier (via logistic regression)
lr_classifier = LogisticRegressionCV(class_weight='balanced', solver='lbfgs', max_iter = 1000, cv=5)

In [None]:
#Train classifier via kfold cross validation
lr_classifier.fit(train_x, train_y)

In [None]:
#Use above encoding methods to create test set
test_x = pd.concat( [pd.DataFrame(wc_encoder.transform(ptest_x['workclass'].to_frame()),
                                 columns=wc_encoder.get_feature_names()),
                     pd.DataFrame(num_scaler.transform(ptest_x[['age', 'hours-per-week']]),
                                  columns = ['age', 'hours-per-week'])], axis=1)
test_y = pd.Series(sc_encoder.transform(ptest_y), name='salary-class')
                   

In [None]:
print(test_x.shape)
test_x.head()

In [None]:
print(test_y.shape)
test_y.head()

In [None]:
#Test the classifier on the held out test set
pred_y = lr_classifier.predict(test_x)

In [None]:
#Can use a classification report to get other metrics:
print("Classification report: \n", classification_report(test_y, pred_y))

In [None]:
#Investigate popular singular metrics
roc_auc_score(test_y, pred_y)

Often, visualization provides a more intuitive understanding of the results.  For a better vis of the confusion matrix, we can use the matplotlib and seaborn packages.

In [None]:
#Look at the confusion matrix of the result of the testing set
conf_mat = confusion_matrix(test_y,pred_y)
conf_mat_ratio = conf_mat/(pred_y.shape[0])
print("Confusion matrix: \n", conf_mat)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sn

plt.figure(figsize=(10,4))

plt.subplot(1,2,1) 
ax = sn.heatmap(pd.DataFrame(conf_mat, index=sc_encoder.classes_, columns=sc_encoder.classes_),
          annot=True, fmt = "d", annot_kws={"size": 14}, cbar=False)
ax.set_xlabel('Predicted class', fontsize=16);
ax.set_ylabel('Actual class', fontsize=16);
ax.set_title('Salary-class Confusion: Counts')

plt.subplot(1,2,2) 
ax = sn.heatmap(pd.DataFrame(conf_mat_ratio, index=sc_encoder.classes_, columns=sc_encoder.classes_),
          annot=True, fmt = "0.2f", annot_kws={"size": 14}, cbar=False)
ax.set_xlabel('Predicted class', fontsize=16);
ax.set_ylabel('Actual class', fontsize=16);
plt.subplots_adjust(wspace=0.4)
ax.set_title('Salary-class Confusion: Ratio');

## Modeling the Data (Alt 2)
One extremely important function of sklearn is to provide pipelines.  This essentially defines a set of steps that should be taken for any data that will be input to the model.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [None]:
#Define parameters of the pipeline
num_preds = ['age', 'hours-per-week']
cat_preds = ['workclass']
res_var = ['salary-class']

#Define steps to be done for each type of variable defined
num_steps = [('scaler', StandardScaler())]
cat_steps = [('encoder', OneHotEncoder(handle_unknown='ignore'))]
res_steps = [('encoder', LabelEncoder())]

#transfomers
num_xformer = Pipeline(num_steps)
cat_xformer = Pipeline(cat_steps)

#Defined preprocessing steps
preprocessor = ColumnTransformer(transformers=[('num', num_xformer, num_preds),
                                              ('cat', cat_xformer, cat_preds)])

#Create full pipeline steps
lr_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                               ('classifier', LogisticRegressionCV(class_weight='balanced', solver='lbfgs',
                                                                   max_iter =1000, cv=5))])

In [None]:
#Split data into relevant portions
data_y = df['salary-class']
data_x = df[['age', 'hours-per-week', 'workclass']]

ptrain_x, ptest_x, ptrain_y, ptest_y = train_test_split(data_x, data_y, train_size=0.85)

In [None]:
#Perform encoding on train_y and test_y
res_encoder = LabelEncoder()
train_y = res_encoder.fit_transform(ptrain_y)
test_y = res_encoder.transform(ptest_y)
res_encoder.classes_

In [None]:
#Perform prediction on the data using the pipeline
lr_pipeline.fit(ptrain_x, train_y)

In [None]:
y_pred = lr_pipeline.predict(ptest_x)

In [None]:
print(classification_report(test_y, y_pred))

In [None]:
#Look at confusion matrix
conf_mat = confusion_matrix(test_y, y_pred)
conf_mat_ratio = conf_mat/(y_pred.shape[0])
conf_mat = confusion_matrix(test_y,y_pred)
print("Confusion matrix: \n", conf_mat)