### Downloading Python Libraries


In [None]:
# You can install things within a notebook by using the ! notation
!pip install pandas scikit-learn numpy nltk

In [2]:
import pandas as pd
import numpy as np

### Reading in the dataset file using Pandas
- This loads the data into a Pandas dataframe object. 
- Dataframe objects provide many functions which make data wrangling easier. 

In [3]:
df = pd.read_csv('Train.csv')

### Exploring the data
- Available features: ['reviewerID', 'amazon-id', 'helpful', 'unixReviewTime', 'reviewText', 'reviewTime', 'summary', 'price', 'categories', 'root-genre', 'title', 'artist', 'label', 'first-release-year', 'songs', 'salesRank','related']
- The target variable is 'overall', which is the album's star rating on Amazon. This variable will dictate what you are predicting. 
- reviewerID-- This is a number identifying a specific Amazon user. 
- amazon-id-- This is the id of the product (album). 
- helpful-- This is a fraction in the form of a list. For example, [2,3] means 2 out of three people found the review helpful. 
- unixReviewTime/reviewTime-- When the review was posted. 
- reviewText-- This is the full textual review of the album by a given user. **important!! 
- summary-- A summary of the review. 
- price-- The price of the album on Amazon. 
- categories-- Which genres the album is listed under on Amazon. 
- root-genre-- Primary genre of the album. 
- title-- Title of album (hashed)
- artist-- Artist name (hashed)
- first-release-year-- The year the album was released. 
- songs-- All of the id numbers for each song on the album. 
- salesRank-- The sales ranking of the album on Amazon. For example, the #1 album on Amazon will have value 1. 
- related-- This is the "users also bought" section on Amazon. Releated contains the amazon ids of the related albums. 


In [4]:
# .head(3) simply returns the top-3 rows from a dataframe
df.head(3)

Unnamed: 0,reviewerID,amazon-id,helpful,unixReviewTime,reviewText,overall,reviewTime,summary,price,categories,root-genre,title,artist,label,first-release-year,songs,salesRank,related
0,-4984057859803657856,1877521326299865484,"[2, 2]",1302739200,Very nice music for practicing my Tai Chi. I d...,4,"04 14, 2011",Beautiful,16.47,"['CDs & Vinyl', 'New Age']",New Age,-3267874170410107454,-7180760356347753735,Cdbaby/Cdbaby,,"[7058439142327364074, 6037075874942075284, 852...",27222,"{'also_bought': [-404470919165672227, 11968160..."
1,9136764282801708742,1877521326299865484,"[11, 11]",1180396800,I recently starting doing Tai Chi which I love...,5,"05 29, 2007",Tranquillity In Motion !!!,16.47,"['CDs & Vinyl', 'New Age']",New Age,-3267874170410107454,-7180760356347753735,Cdbaby/Cdbaby,,"[7058439142327364074, 6037075874942075284, 852...",27222,"{'also_bought': [-404470919165672227, 11968160..."
2,2164551966908582519,1877521326299865484,"[0, 0]",1361404800,My wife uses it for her class room the kids lo...,5,"02 21, 2013",Great Stuff,16.47,"['CDs & Vinyl', 'New Age']",New Age,-3267874170410107454,-7180760356347753735,Cdbaby/Cdbaby,,"[7058439142327364074, 6037075874942075284, 852...",27222,"{'also_bought': [-404470919165672227, 11968160..."


### Accessing data using pandas 
##### Below are some of the basic pandas functions you want to become familiar with. (Of course, you don't have to use pandas, I just believe it will make your life much easier. )
- df['column name'] allows you to return columns by name
- df.iloc[int] allows you to return rows by index. 
- df[['column 1', 'column2']] allows you to return a subset of columns from the data frame. 
- df.shape returns a tuple with the number of (rows, columns) which in a ML context means (number of samples, number of features)
- df['new column'] = 1D array of equal length, makes a new column called 'new column'

Note: I call .head() on the examples below to save space and not put the full column in the notebook. 

In [5]:
df['overall'].head()

0    4
1    5
2    5
3    5
4    5
Name: overall, dtype: int64

In [6]:
df[['amazon-id', 'overall']].head()

Unnamed: 0,amazon-id,overall
0,1877521326299865484,4
1,1877521326299865484,5
2,1877521326299865484,5
3,1877521326299865484,5
4,1877521326299865484,5


In [7]:
df.iloc[0]

reviewerID                                         -4984057859803657856
amazon-id                                           1877521326299865484
helpful                                                          [2, 2]
unixReviewTime                                               1302739200
reviewText            Very nice music for practicing my Tai Chi. I d...
overall                                                               4
reviewTime                                                  04 14, 2011
summary                                                       Beautiful
price                                                             16.47
categories                                   ['CDs & Vinyl', 'New Age']
root-genre                                                      New Age
title                                              -3267874170410107454
artist                                             -7180760356347753735
label                                                     Cdbaby

In [8]:
print("Number of samples: {} Number of Features: {}".format(df.shape[0], df.shape[1]))

Number of samples: 111098 Number of Features: 18


### Pandas data cleaning example
 - df.isna() returns a boolean array with a length equal to the number of rows in the data frame. If df.isna()[10] is True, that means row 10 has a null value. 
     - This logic works the same for df.notna() 
 - If we pass a boolean array to df, it will return only the rows where values are True.
 
 What we do below is only extract the rows where we have the first-release-year information. Rows where that value is nan are discarded. 

In [9]:
print(df.shape)
print(df[df['first-release-year'].notna()].shape)
df[df['first-release-year'].notna()].head() # Returns rows where df['image'].notna() is true

(111098, 18)
(99826, 18)


Unnamed: 0,reviewerID,amazon-id,helpful,unixReviewTime,reviewText,overall,reviewTime,summary,price,categories,root-genre,title,artist,label,first-release-year,songs,salesRank,related
30,-3029154682982670675,2828769427501176858,"[13, 13]",1274486400,When I was in Music School at Stetson Universi...,5,"05 22, 2010",A Great Teaching tool,15.0,"['CDs & Vinyl', 'Classical', 'Sacred & Religio...",Classical,-3686601028207514605,-1412275221690118390,Solesmes,2002.0,"[-5278554366520980165, 1031059927497114547, -8...",171890,"{'also_bought': [-2564721328448647001, 4039811..."
31,-6276416742486014581,1609491003013328661,"[0, 0]",1357084800,"Fight like this, Walking Dead. Both songs are ...",5,"01 2, 2013",Decyfer Down at their best.,14.87,"['CDs & Vinyl', 'Rock', 'Hard Rock']",Rock,7453282935115361569,-3962525740789261387,SRE Records,2006.0,"[-723861579512861737, -4694740640801132236, 27...",543899,"{'also_bought': [6759112278599618175, -2155016..."
32,-84902388281572817,1683355681609577463,"[0, 1]",1395619200,After seeing this selling on line for as much ...,4,"03 24, 2014",Metallica - Death Magnetic (Deluxe Coffin Boxset),99.99,"['CDs & Vinyl', 'Rock']",Rock,2278261453371246336,1178076565539150448,Mercury (Universal),2008.0,"[6232524707980398149, 5566973718179316922, -43...",503636,"{'also_viewed': [-5109793234044987754, 5931663..."
33,-8209243714021876578,245546254853962888,"[1, 1]",1087430400,Let me start by saying that I am not a die-har...,5,"06 17, 2004",Best Comedy Recording Ever,36.97,"['CDs & Vinyl', 'Dance & Electronic']",Dance & Electronic,-3282866979189163366,8493232817338375900,Enigma,1990.0,"[-5143272649889793807, -547543979015767703, -8...",199821,"{'also_bought': [5334117667833916410, 27299936..."
34,-2614895023138152712,245546254853962888,"[0, 0]",1364774400,Been looking for this CD for a while. Descrip...,5,"04 1, 2013",Awesome,36.97,"['CDs & Vinyl', 'Dance & Electronic']",Dance & Electronic,-3282866979189163366,8493232817338375900,Enigma,1990.0,"[-5143272649889793807, -547543979015767703, -8...",199821,"{'also_bought': [5334117667833916410, 27299936..."


### Tip #1 for the project

- NaN Values need to be dealt with as ML models have no clue what to do with them. You will have to determine if 1) the entire column should be dropped 2) only the rows with nan values should be dropped 3) the rows with nan values should be filled in with some number extracted from the available data. 
    - For example, if a product has no price, you can 1) not use price information at all 2) get rid of that sample 3) somehow infer the price from other information? 

### The Pandas .apply() function
In my opinion, this will be one of the most useful tools when dealing with pandas dataframes. apply() lets you apply a function to all cells of a column very efficiently. Below is an example of how we can convert the 'helpful' feature into a numeric value.

Note, you can define your own function like 'convert_helpful' or, simply use a lambda function and have all your code live inside of the apply function. 

In [10]:
# literal_eval can take a string '[0,1]' and convert it to a list [0,1]
# Sometimes, saving list objects as csv's causes them to be represented as strings
# which is why we need to do this here. 
from ast import literal_eval

In [11]:
def convert_helpful(x):
    x = literal_eval(x)
    if x[1] == 0:
        return np.nan
    else:
        return x[0]/x[1]

# I use .iloc[45:50] here simply because these 5 samples
# were more diverse than samples 1-5
print(df['helpful'].iloc[45:50])
print(df['helpful'].apply(convert_helpful).iloc[45:50])
print(df['helpful'].apply(lambda x: np.nan if literal_eval(x)[1]== 0 else literal_eval(x)[0]/literal_eval(x)[1]).iloc[45:50])

45    [10, 11]
46      [4, 4]
47      [0, 0]
48    [13, 14]
49      [2, 3]
Name: helpful, dtype: object
45    0.909091
46    1.000000
47         NaN
48    0.928571
49    0.666667
Name: helpful, dtype: float64
45    0.909091
46    1.000000
47         NaN
48    0.928571
49    0.666667
Name: helpful, dtype: float64


## Structuring the project for Binary Classification
- Recall that we are using amazon review data to predict album ratings. More specifically, we want to see if a product review is over or under a certain threshold. For this problem, we say that an album is "awesome" if its average rating is over 4.5 (not inclusive). Otherwise, the product is "not awesome". The data is not given to us with a binary 1(over)/0(under) column.
- Below, I will get you started on how to accomplish this with a pandas function called groupby. From here, you should be able to figure out how to make the target variable. 
- Many products in the training file will have more than one review. Thus, we need to find the average overall score for a product. 
- Groupby aggregates columns with the same values for you to then perform an operation on the aggregated columns. Below, I show the mean product review for each product. you could have also called .std() for standard deviation or defined your own function.

Hint: Use .apply() to convert a groupby'd column to your target variable. 

In [12]:
avg = df[['amazon-id', 'overall']].groupby('amazon-id').mean()
avg.head()

Unnamed: 0_level_0,overall
amazon-id,Unnamed: 1_level_1
-9217723718720870868,4.333333
-9215746463819797371,5.0
-9213978596308513604,3.0
-9211290576571923870,4.5
-9208769561690910545,4.5


## A final note on pandas 
- First of all, Pandas is my favorite python library and I love talking about it so feel free to ask me any questions. 
- There are a few other functions I found useful that I wanted to mention for you to look up on your own. 
    - df.drop([list of columns] or [list of indices], axis = 1 or 0) -> the drop function will be useful for getting rid of unwated data. 
        - So for example, if you have the subet df[df['overall'] > 3], you can get the indices by calling .index and use these to drop them from the main df if you for some reason did not want ratings over 3. The same logic applies for columns. 
    - df.reset_index() will reset the dataframe index to start from 0 and count up. Sometimes when you process data the indices get mixed up and this can cause weird problems. Just remember that I said this. 
    - you can save these files with df.to_csv('filename.csv') - once you clean your data accordingly this might save you some time to do the processing first, then save the dataframe as a csv, then load the csv. 

# Introduction to Scikit-Learn 
- Scikit-Learn is a machine learning library that will allow us to perform binary classification for our project. That is not to suggest that is all sklearn is capable of, as it is THE machine learning library. Take the time to get to know this as if you plan on working in ML, you will certainly be using Scikit-Learn at some point. 
- I am going to go over how to do classification on a toy dataset to give you an idea of how this API works. First, we will  download the iris dataset. 

In [13]:
from sklearn.datasets import load_iris # load a toy built-in dataset
iris = load_iris()
X = iris.data # get features
column_names = iris.feature_names
y = iris.target # get labels
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


In [14]:
df = pd.DataFrame(X, columns = column_names) # sklearn works nicely with pandas
df['target'] = y
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [15]:
df['target'].value_counts()

2    50
1    50
0    50
Name: target, dtype: int64

### Splitting our data into training and test sets. 
- In Machine Learning, we need to make sure we validate our models on unseen data as to ensure what we have learned will generalize to all situations and not simply memorize the given training data. There is a handy function in sklearn called train_test_split which makes this process very simple. 
- The function takes in all of your data and labels, and then you use test_size to determine what percent of your data should be reserved for testing. 
    - Note: There is also a parameter called stratify that might be worth looking into for this project if you come across any class imbalance in your random train/test splits. 

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size = 0.3)
print('Full dataset shape: ', df.shape[0])
print('Train dataset shape: ', X_train.shape[0])
print('Test dataset shape: ', X_test.shape[0]) 

Full dataset shape:  150
Train dataset shape:  105
Test dataset shape:  45


## Performing classification

- As you may have learned in class thus far, one way to perform classification is to use the naive bayes classification algorithm. The beautiful thing about sklearn is that it abstracts all of the complicated details about ML algorithms and only demands that we properly prepare the data. 
- All sklearn prediction models will follow the following design paradigm
    - model.fit(X_train, y_train)
    - predictions = model.predict(X_test)
    - accuracy = model.score(X_test, y_test)

In [17]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9555555555555556

### Tip #2 for the project. 
- Even though fitting and testing an sklearn model seems relatively easy, getting good results on more complex data (like amazon reviews) will require knowledge of how the model works as you will need to tune the hyperparameters accordingly. 
    - Make sure to understand how these classification algorithms work, read the documentation for them on sklearn's website, and look up what each of the hyperparameters do. Not only will this help your ML education but it will certainly help you get good results on the project. 
- Think about what evaluation metrics will best help you quantify how your model is performing. Accuracy is not always the best metric. 

### The hard part about this project
- You have seen me do a quick classification model on the iris dataset. Note that this was so easy because we were given good features to learn from! In the amazon reviews challenge, you cannot simply pass the reviews to a naive bayes classifier as NB has no clue what to do with text. 
- The main challenge will be to find meaningful numeric features for the text data. These text features, along with the other available information (price, artist, etc.) will then be used to predict the binary target variable for each album.
    - Please see the professors slide deck from lecture one on NLP for tips on where to start with generating these features. 
    - I included pip install nltk at the beginning of this notebook because this library will help you get started with creating these features. It will be key for pre-processing the data. pandas, sklearn, and numpy should have everything else you need as you are not allowed to use neural methods for prediction. 