# Dealing with Data Spring 2022 – Class 8

_For today's class, don't worry too much about following along with the code. Rather, consider this an exercise in better understanding what the skills you are learning in this class look like when applied to a real-world scenario._

---------------

# Step 1: Explore our Data

First, we will change some of our Pandas settings so that we display more row and column values. You can find more setting options [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html).

In [None]:
import pandas as pd

pd.options.display.max_rows = 5000
pd.options.display.max_columns = 100 

Next, we'll read in our csv using the pd.read_csv() function. 

In [None]:
data = pd.read_csv('propensity_data.csv') # we are reading in our csv and assinging it the value 'data'. 
                                          # you can, of course, name this whatever you'd like!

# fun fact, you can use the 'Tab' button to auto-complete a file name!

In [None]:
data.head() # '.head()' will give you the first 5 rows. '.tail()' will give you the last 5 rows

In [None]:
list(data) # let's get a list of all the column names in our file

> *pageID*: A random ID assigned to the instance.

> *paywall:* What paywall experience the visitor got.

> *time:* What time the content was accessed.

> *daysSinceFirstSeen:* Days since we first saw the visitor.

> *section:* On what section of the site did the article view occur? 

> *visitNum:* Visit number.

> *pageNum:* Page number within the visit.

> *registered:* Whether or not the user is registered.

> *edu:* Whether or not the user is visiting from a '.edu' domain.

> *mobile:* Whether or not the user is on a mobile phone.

> *mac:* Whether or not the user is using an Apple device.

> *converted:* A binary value, '0' for did not convert, '1' for did convert.

In [None]:
data.info() # some basic info on our dataframe

In [None]:
data.count() # this is telling us the number of non-null values we have per column

In [None]:
data.sample() # a random sample row from our data frame

In [None]:
data.iloc[3] # .iloc is how we index through a data frame. In this case, we're asking for the first row
             # remember, too, Python is 0-index, so the 3rd value is actually the fourth row!

In [None]:
data.iloc[3,2] # [row, column] aka, the fourth row, third column (time)

Note: `loc` is used when you are searching by label (e.g., a column name) whereas `iloc` is used when you are seearching by index.

In [None]:
data.sample(random_state=1) # random_state ensures that we all get the same values :) 

^ We can see that our user, #536022, was last seen 194 days ago when he/she visited the Personal Finance section three times , and he/she did not convert. 

---

# ⭕ **QUESTIONS?**

---

# Step 2: Exploratory Analysis

## What is the average conversion rate for our sample users? 

In [None]:
avg_conversion = data['converted'].mean()

avg_conversion

In [None]:
# below we are just reformatting our value for ease of reading

percentage = '{0:.2f}'.format(avg_conversion) 
print(percentage + " = to two decimals places.")

percentage = '{0:.4f}'.format(avg_conversion)
print(percentage + " = to four decimal places.")

percentage = '{0:.6f}'.format(avg_conversion)
print(percentage + " = to six decimal places.")

## What about the difference in conversion rate between locked and open paywall users? 

In [None]:
conversion_by_paywall = data.groupby('paywall').mean()
conversion_by_paywall['converted']

## What about the difference in conversion rate as it relates to visit numbers? 

In [None]:
conversion_by_paywall = data.groupby('visitNum').mean()
conversion_by_paywall['converted'].sort_values(ascending=False)

In [None]:
data.loc[data['visitNum']==1345]

In [None]:
data.loc[data['visitNum']==642]

## What about the difference in conversion rate as it relates to section? 

In [None]:
conversion_by_paywall = data.groupby('section').mean()
conversion_by_paywall['converted']

## How about some graphical insights?

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.heatmap(data.corr(),annot=True)

---

# ⭕ **QUESTIONS?**

---

# Step 3: Feature Engineering

In order to use Linear Regression, we can't have text values, but, right now 'paywall' and 'section' are all text. A computer doesn't know what Open versus Locked means, nor does it know what Funds versus Personal Finance means. So, we create dummy variables!

In [None]:
dummy = pd.get_dummies(data['section'])
dummy.head()

In [None]:
data = pd.concat([data,dummy],axis=1) # axis=1 means we are adding by column, not the row
data.head()

In [None]:
dummy2 = pd.get_dummies(data['paywall'])
data = pd.concat([data,dummy2],axis=1)

data.head()

Lastly, using a raw number for the hour of the day visited is fine, but it would be more helpful if we can capture cyclical effects. For instance, 11pm and 1 am are more similar than 11pm and 5 pm – using the raw number wouldn't capture that relationship. 

In [None]:
from datetime import datetime

data['time'] = pd.to_datetime(data['time'],errors='coerce')

_Working with datetime is notoriously frustrating. For documentation check [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)._

In [None]:
data['weekday'] = data['time'].dt.weekday # Monday = 0, Sunday = 6
data['weekday'] = (data['weekday'] < 5).astype(int) 

data['morning'] = data['time'].dt.hour # 0 - 23
data['morning'] = (data['morning'] < 12).astype(int)

data.head()

---

# ⭕ **QUESTIONS?**

---

# Step 4: Modeling

First, let's remind ourselves of the columns we have after implementing our dummy variables. 

In [None]:
## here are two ways to select numeric data and keep columns

# we are going to select only our numeric data, then all rows for all columns except the 'converted' and 'pageID' columns

numeric_data = data.select_dtypes(exclude=['object', 'datetime64'])
X1 = numeric_data.loc[:, ~numeric_data.columns.isin(['converted', 'pageID'])]
# the ~ allows us to check if the values are NOT in the data frame

### OR more pythonically

keep_columns = [x for x in numeric_data.columns if x not in ['converted','pageID']]
X2 = numeric_data.loc[:, keep_columns]
X2.head()

In [None]:
data.head()

In [None]:
data.drop(['time'],axis=1,inplace=True) # inplace = True means it's a permanent change

In [None]:
data.head()

In [None]:
X = data[['daysSinceFirstSeen','visitNum','pageNum','registered','edu','mobile','mac','Funds','Personal Finance','Retirement Planning','Stock Tips','Locked','Open','weekday','morning']] 
y = data['converted']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .4, random_state = 101)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
regr = LinearRegression()

In [None]:
regr.fit(X_train,y_train)

In [None]:
y_pred = regr.predict(X_test)

In [None]:
print(y_pred)

In [None]:
data2 = pd.DataFrame(y_pred).rename(columns=lambda x: 'PredictedValue')
data2

In [None]:
X_test['predictions'] = y_pred

In [None]:
dataAll = pd.concat([data,data2],axis=1)
dataAll.head()

In [None]:
coeff_df = pd.DataFrame(regr.coef_,X.columns,columns=['Coefficient'])

coeff_df

In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt

print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

RMSE (Root Mean Squared Error) is the Standard Deviation of the residuals (prediction error). The residuals are thus a measure of how far from the regression line the data points are. 

If there is a perfect correlation (correlation coefficient = 1), the RMSE will be 0, because all the points will lie on the regression line, thus, there are no errors. 

So, we didn't do too bad with an RMSE of approximately .034!

But, let's try a logistic regression, which is more appropriate given that we're trying to do a binary classification (convert versus not convert).

In [None]:
from sklearn.linear_model import LogisticRegression 
from sklearn import metrics

logreg = LogisticRegression(solver='lbfgs')

X = data[['pageID','daysSinceFirstSeen','visitNum','pageNum','registered','edu','mobile','mac','Funds','Personal Finance','Retirement Planning','Stock Tips','Locked','Open','weekday','morning']] 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .4, random_state = 101)

In [None]:
#print("X_train_info" + str(X_train.describe()))
#print("X_test_info" + str(X_test.describe()))
#print("y_train_info" + str(y_train.describe()))
#print("y_test_info" + str(y_test.describe(())))

In [None]:
logreg.fit(X_train, y_train, sample_weight=None)

y_pred = logreg.predict(X_test)

In [None]:
score = logreg.score(X_test,y_test)
print(score)

In [None]:
cm = metrics.confusion_matrix(y_test, y_test)
print(cm)

In [None]:
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);

Does this seem too good to be true? Because it is! Basically, because the average conversion rate is so low, the model just 'assumed' the default 'not converted' for most samples, which would be correct 99.8% of the time. 

---

# ⭕ **QUESTIONS?**

---

In [None]:
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

classifiers = [
    KNeighborsClassifier(3),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis()]

# Logging for Visual Comparison
log_cols=["Classifier", "Accuracy", "Log Loss"]
log = pd.DataFrame(columns=log_cols)

for clf in classifiers:
    clf.fit(X_train, y_train)
    name = clf.__class__.__name__
    
    print("="*30)
    print(name)
    
    print('****Results****')
    train_predictions = clf.predict(X_test)
    acc = accuracy_score(y_test, train_predictions)
    print("Accuracy: {:.4%}".format(acc))
    
    train_predictions = clf.predict_proba(X_test)
    ll = log_loss(y_test, train_predictions)
    print("Log Loss: {}".format(ll))
    
    log_entry = pd.DataFrame([[name, acc*100, ll]], columns=log_cols)
    log = log.append(log_entry)
    
print("="*30)

---

# ⭕ **QUESTIONS?**

---