# Homework 1

This homework contains two parts, Part 1 on Linear Regression and Part 2 on Logistic Regression. Both parts use real world data and will introduce you to techniques used in the workforce! As a reminder, DO NOT edit anything in this python notebook! All of your code will be contained in the functions located in answers.py. You should be able to complete this assignment with no extra imports so please use what we have given you.

Some functions you may want to take note of before you start:
- Pandas .corr() function to calculate correlation
- Pandas .mean(), .median(), .std() functions
- Pandas mapping a lambda function eg .map(lambda x: x)
- ' '.join(x) function where x is a list
- sklearn train_test_split() function
- Python strings .isalpha() function
- sklearns confusion_matrix() function
- Pandas .plot() function
- statsmodels .summary() function
- numpy .linspace() function
- scipy norm.fit(), norm.pdf() functions
- statsmodels qqplot() function
- statsmodels .predict() function
- random.normal() function

Make sure you have all of the packages installed, if you do not use this command: "pip install pandas numpy sklearn matplotlib statsmodels nltk scipy"

In [None]:
# Import all of the necessary packages
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from numpy import random
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from scipy.stats import norm
import nltk
from statsmodels.graphics.gofplots import qqplot
from answers import *

## Linear Regression - Part 1
Part 1 of the homework will focus applying Linear Regression to real world data. 
<br>
McDonald's is a global fast food chain that serves hamburgers in 119 different countries. McDonald's flagship hamburger is the Big Mac. Just like all other things in this world, the cost of the Big Mac fluctuates with the enconomy, however, the Big Mac is in a unique position by its presence in over half of the worlds countries. Through this exercise we will use linear regression to see if global markets data can be used to predict the cost of a Big Mac.

This is part one of the coding assignment, it will go through cleaning the data we have provided to you and then training a linear regression model on this data to predict the dollar_price variable. The columns in this dataset we will be concerned with are local_price (the price of the big mac in native currency), dollar_ex (the exchange rate of the dollar from that native currency), the dollar_price (the converted dollar value of a big mac), GDP_dollar (the GDP of the dollar at sample time), adj_price (the adjusted big mac price), and USD, EUR, GBP, JPY, and CNY (the values of various world currencies)

In [None]:
# Read in the county data as a pandas DataFrame
bmdf = pd.read_csv("big-mac-adjusted-index.csv")
bmdf

In [None]:
# Note the columns that are present, which ones we will be using as independent, and that the dependent column is dollar_price
indepCols = ['local_price', 'dollar_ex', 'GDP_dollar', 'adj_price', 'USD', 'EUR', 'GBP', 'JPY', 'CNY']
depCol = 'dollar_price'
bmdf.columns

In [None]:
# Take a look at some of the attributes of our DataFrame and become familiar with the data
bmdf.describe()

In [None]:
# 1.1 (2 pts.) 
# Edit the python function in answers.py to find all of the rows where dollar_price is zero and return that DataFrame
# RETURN: Modified DataFrame
findZeroDollarPrice(bmdf)

In [None]:
# 1.2 (1 pts.)
# It is invalid for a Big Mac to be free (dollar_price = zero) so we will replace the dollar price with NaN (similar to null or 
# None) use the replace method to replace 0 in the dollar_price column with np.NaN. We do this so the invalid values will not 
# mess with our mean, median, or standard deviation metrics when replacing these invalid values.
# RETURN: Modified DataFrame
bmdf = replaceZeroWithNaN(bmdf)

In [None]:
# Lets make a test copy for our DataFrame and test a few ways we could try to fix the dollar price NaN values.
# Note that pandas is smart, NaN will not be factored into any sort of mean, median, std calculations so we don't have to worry
# about this messing with our experiments.
bmdfTest = bmdf.copy()

In [None]:
# Here we define a function to print the Mean, Standard Deviation, and Median. Note the shortcuts you can use to calculate these
# values for a column. We will use this function to show how different handelings of NaN values can influence our dataset, for
# example printed from this cell is the metrics for dollar_price if NaN was not counted at all.
def printMeanStdMedian(bmdfTest):
    print(f'mean: {bmdfTest["dollar_price"].mean()}, std: {bmdfTest["dollar_price"].std()}, median:{bmdfTest["dollar_price"].median()}')
printMeanStdMedian(bmdfTest)

In [None]:
# 1.3 (1 pts.) 
# Fill in the function replaceNaNWithZero that will replace the dollar_price NaN values with zero
# RETURN: Modified DataFrame
bmdfTest = replaceNaNWithZero(bmdfTest)
printMeanStdMedian(bmdf)
bmdfTest = bmdf.copy()

In [None]:
# 1.4 (1 pts.) 
# Fill in the function replaceNaNWithMean that will replace the dollar_price NaN values with the dollar_price mean
# RETURN: Modified DataFrame
bmdfTest = replaceNaNWithMean(bmdfTest)
printMeanStdMedian(bmdfTest)
bmdfTest = bmdf.copy()

In [None]:
# 1.5 (1 pts.) 
# Fill in the function replaceNaNWithMedian that will replace the dollar_price NaN values with the dollar_price median 
# RETURN: Modified DataFrame
bmdfTest = replaceNaNWithMedian(bmdfTest)
printMeanStdMedian(bmdfTest)
bmdfTest = bmdf.copy()

In [None]:
# 1.6 (2 pts.) 
# Fill in the function replaceNaNWithNormal that will replace the dollar_price NaN values with random samples from a normal
# distribution with the same mean and standard deviation as dollar_price
# (hint: use random.normal() to generate values to replace the NaN values)
# RETURN: Modified DataFrame
bmdfTest = replaceNaNWithNormal(bmdfTest)
printMeanStdMedian(bmdfTest)
bmdfTest = bmdf.copy()

In [None]:
# Lets say we have decided on filling our invalid values with the mean
bmdf = replaceNaNWithMean(bmdf)
printMeanStdMedian(bmdf)

In [None]:
# 1.7 (3 pts.)
# To get a better idea of our data, we now want to graph scatter plots of each independent variable vs our 
# dependent variable dollar_price. Fill out the function graphIndepVsDep by graphing each independent variable against
# dollar_price. Set a title for each graph and label the x and y axis. 
# Make sure to display the plots below this cell
# RETURN: Nothing, display graphs
graphIndepVsDep(bmdf, indepCols)

In [None]:
# 1.8 (1 pts.) 
# Interesting... there seems to be two points in particular with a significantly higher (>60,000) local_price. Fill in the
# localPriceOutlier function to return the two points rows as a dataframe.
# RETURN: DataFrame with outliers
localPriceOutlier(bmdf)

In [None]:
# 1.9 (1 pts.)
# There seems to be a similar occurance with the dollar_ex variable, fill out the dollarExOutliers function to output all
# the information about rows with more than 20,000 dollar_ex
# RETURN: DataFrame with outliers
dollarExOutliers(bmdf)

In [None]:
# 1.10 (3 pts.) 
# Lets check the correlation between all of our numeric variables. Fill in the function correlationHeatmap that will calculate
# the correlation matrix of all our variables and then graph them as a heatmap. 
# Make sure to display the plot below this cell
# RETURN: Nothing, display a graph
correlationHeatmap(bmdf)

In [None]:
# 1.11 (3 pts.) 
# Now lets do some Linear Regression! Fill in the linearRegressionFit function using the ols and fit function from statsmodels.
# Use GDP_dollar and USD as independent variables and dollar_price as the dependent variable.
# Print the model summary as part of the linearRegressionFit function. linearRegressionFit should return the fitted model.
# RETURN: Linear Regression model
model = linearRegressionFit(bmdf)

In [None]:
# Notice the actual weights assigned to our parameters
model.params

In [None]:
# Notice the p values assigned to our parameters, they are low because both variables have a significant effect on the model
model.pvalues

In [None]:
# We can even look at the rsquared value
model.rsquared

In [None]:
# Now we will add the residuals to our bmdf DataFrame for further analysis
bmdf['residual']=model.resid
bmdf

In [None]:
# 1.12 (3 pts.) 
# We would like to see how our independent variables relate to the residuals, fill out the function graphIndepVsResidual
# to generate a scatter plot for each independent variable passed in where the residuals are on the y axis and the independent
# variable is on the x axis. Make sure to include a title as well as labels for the x and y axis. 
# Make sure to display the plots below this cell
# RETURN: Nothing, display a graph
graphIndepVsResidual(bmdf, ['GDP_dollar', 'USD'], model)

In [None]:
# 1.13 (5 pts.) 
# Fill in the function histOfResiduals to generate a histogram of the residuals and overlay a normal curve on top of this
# histogram. Us norm.fit, np.linspace and norm.pdf to generate the data for a normal curve and make sure to set density=True 
# when plotting the histogram. Be sure to title your plot as well as label the x and y axis
# Make sure to display the plot below this cell
# RETURN: Nothing, display a graph
histOfResiduals(bmdf)

In [None]:
# 1.14 (3 pts.) 
# Using the qqplot function imported from statsmodels graph the QQ Plot of the model residuals with a 45 degree line (you can
# add such a line in the arguments of the qqplot function, make sure to set fit=True)
# Make sure to display the plot below this cell
# RETURN: Nothing, display a graph
graphQQPlot(model)

In [None]:
# 1.15 (3 pts.) 
# Using the model, access the fitted values and residual values to fill in the graphFittedVsResidual which graphs a scatter
# plot of fitted values on the x axis vs residual values on the y axis. Make sure to add a title and label the x and y axis
# Make sure to display the plot below this cell
# RETURN: Nothing, display a graph
graphFittedVsResidual(model)

In [None]:
# 1.16 (2 pts.) 
# Using the model.predict function to fill out the predictLinearRegression function that takes in the model, prints out the
# data that you are predicting and the predicted value. 
# RETURN: Floating point predicted number
predictLinearRegression(model, 1000, 2)

## Logistic Regression - Part 2
In this section we will be predicting whether a movie review from IMDB is positive or negative using logistic regression! The IMDB movie review dataset is a popular sentiment analysis dataset that is used in many parts of machine learning. To keep this assignment manageable, we have provided a subset of 6,000 movie reviews as well as their associated labels positive (1) and negative (0) in imdbReviews.csv. These labels were made and validated by humans so we accept them as a ground truth. 

In [None]:
# Download the corpus
df = pd.read_csv('imdbReviews.csv')
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
# This cell will give you a random row from our DataFrame, feel free to run it as many times as you like to get a feel for 
# what is generally labeled as negative and what is considered positive.
sample = df.sample()
print(f'Score: {sample["label"].values[0]}\n\nText: {sample["text"].values[0]}')

In [None]:
# 2.1 (3 pts.)
# Now we need to edit the text so that it will be simpler for us to vectorize it and run logistic regression. The first
# modification we will make is sending all of the text to lower case. Fill out the lowerCase function and return the
# DataFrame with all of the text in lower case. (Hint: the map function in pandas may be helpful)
# RETURN : the modified DataFrame, make all edits within the text column
df = lowerCase(df)
df

In [None]:
# 2.2 (3 pts.) 
# Next, we are going to tokenize our text so we can filter out unwanted pieces of the sentence. Fill out the tokenizeDF function
# so that all of the text is tokenized using word_tokenize(). (Hint: the map function in pandas may be helpful)
# RETURN : the modified DataFrame, make all edits within the text column
df = tokenizeDF(df)
df

In [None]:
# 2.3 (4 pts.) 
# Now, we are going to filter through the tokenized text and in each row removing all of the stop words.
# Fill out the removeStop function to do this and return the modified DataFrame. We also pass in a set of stop words
# in english.
# RETURN : the modified DataFrame, make all edits within the text column
stopeng = set(stopwords.words('english'))
df = removeStop(df, stopeng)
df

In [None]:
# 2.4 (4 pts.) 
# Now, we are going to filter through the tokenized text and in each row only keeping words that have any
# alphabetic characters in them, and removing any tokens that are of size 0 or 1.
# Fill out the removeStopKeepAlpha function to do this and return the modified DataFrame. We also pass in a set of stop words
# in english.
# RETURN : the modified DataFrame, make all edits within the text column
df = keepAlpha(df)
df

In [None]:
# 2.5 (3 pts.) 
# Lastly, we will join all of the tokenized words per row back into single strings. Fill out the joinText function to join
# your tokenized words together and return the new DataFrame
# RETURN : the modified DataFrame, make all edits within the text column 
df = joinText(df)
df

In [None]:
# 2.6 (4 pts.)
# We need to vectorize our new text so that we can use it as input for the logistic regression. Fill in the function
# countVectorize to turn our DataFrame df into a matrix X which has the shape (num of examples, vocabulary size) and array
# y which has the shape (num of examples). Use the CountVectorizer method to achieve this.
# RETURN: X, y, and vectorizer
X, y, vectorizer = countVectorize(df)
print(X)
print(y)

In [None]:
# 2.7 (3 pts.) 
# Fill in the function splitData using the train_test_split function to get a test size of .25 and set the random state to 42.
X_train, X_test, y_train, y_test = splitData(X, y)
print(f'XTrain length: {len(X_train)}, XTest length: {len(X_test)}, YTrain length: {len(y_train)}, YTest length: {len(y_test)}')

In [None]:
# Here we make sure our variables are all of the correct type.
y_test = y_test.astype(int)
y_train = y_train.astype(int)

X_test = X_test.astype(int)
X_train = X_train.astype(int)

In [None]:
# 2.8 (4 pts.) 
# Fill in the function trainLogisticRegression by creating a LogisticRegression object with a random state of 42 and then
# fitting that object to X_train and y_train. Then return the score when we try to predict on our test set. If you get a 
# convergence warning it's fine, we could modify some hyperparameters here however its generally not neccessary for our simple
# task.
# RETURN: the model (clf), the accuracy as a decimal number for example 0.633
clf, accuracy = trainLogisticRegression(X_train, X_test, y_train, y_test)
print(accuracy)

In [None]:
# 2.9 (4 pts.)
# Fill in the scoreTest function to vectorize and evaluate the prediction as well as confidence scores in the prediction given
# any text. You may want to go through the sklearn LogisticRegression documents found here:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
# to come up with an answer.
# RETURN: predicted label (score), predicted confidence interval (confidence) both as decimal numbers
text = 'I love the work you are doing'
score, confidence = scoreText(clf, vectorizer, text)
print(f'Score: {score} with confidence: {confidence}')

In [None]:
# 2.10 (3 pts.) 
# Similar to countVectorize, use the TfidfVectorizer to vectorize our data using the TfIdf method.
# RETURN: X, y with the same shapes as countVectorize
X, y, vectorizer = tfidfVectorize(df)
print(X)
print(y)

In [None]:
# Here we re-run the training process to see how TfIdf compares to a simple frequency binning. No modifications necessary.
X_train, X_test, y_train, y_test = splitData(X, y)
print(f'XTrain length: {len(X_train)}, XTest length: {len(X_test)}, YTrain length: {len(y_train)}, YTest length: {len(y_test)}')

clf, accuracy = trainLogisticRegression(X_train, X_test, y_train, y_test)
print(accuracy)

## Extra Space 
Make new cells after here if you need extra space to answer analysis questions or think about problems