# Overall:

1. Use OMDB to extract `['Runtime', 'Released','imdbVotes', 'imdbRating','Genre', 'Rated', 'Type']`

    - Runtime: numerical
    - Released: numerical
    - imdbVotes: numerical
    - imdbRating: numerical
    - Genre: categorical.  Don't want rank based.  Take in all categories, perform one_hot_encoding
    - Rated: categorical.  Rank based.
    - Type: categorical. one_hot_encoding  

2. Perform TFIDF
3. Create a model of the text data using CountVectorizer()


### Baseline (just CountVectorizer()):
`[ 0.63924843  0.61878914  0.6031746   0.5879649   0.61011283]`  
Mean Accuracy in Cross-Validation = 0.612

### Baseline + TFIDF:

Folds = 5  
`[ 0.66471816  0.65302714  0.63199666  0.61805265  0.63435019]`  
Mean Accuracy in Cross-Validation = 0.640

Folds = 10  
`[ 0.66444073  0.67195326  0.6566416   0.65413534  0.62238931  0.64160401
  0.6299081   0.56390977  0.60451505  0.65301003]`  
Mean Accuracy in Cross-Validation = 0.636

## Implementing Tfidf
The term frequency $\texttt{tf(d,t)}$ is a measure of the frequency with which term $t$ appears in document $d$.  The inverse document frequency $\texttt{idf(t)}$ is a measure of how much information the word provides, that is, whether the term is common or rare across all documents.

In [None]:
tvec = TfidfVectorizer()  
X_tf_train = tvec.fit_transform(text_train)  
X_tf_test  = tvec.transform(text_test)  

## Data:
### Train:

dfTrain.shape = (11970, 4)
dfTest.shape = (1477,4)

columns = `['sentence', 'spoiler', 'page', 'trope']`  
    
**sentence** = sentence in comments section  
**trope** = tropes are devices and conventions that a writer can reasonably rely on as being present in the audience members' minds and expectations  
**page** = page on which comment was found. should correspond to a series, movie, show.  
**spoiler** = TRUE or FALSE. actual classification  
    



In [7]:
import pandas as pd
import numpy as np
import re
from datetime import datetime
from sklearn.preprocessing import Imputer

In [None]:
# select only certain datatypes.
sample_df.select_dtypes(include = ['float'])

In [121]:
data = pd.read_csv('2018-02-22 17_20api_response.csv')
data.info()
data.head(n=20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 679 entries, 0 to 678
Data columns (total 9 columns):
unique_unsep    679 non-null object
unique_sep      679 non-null object
Runtime         492 non-null float64
Released        505 non-null object
imdbVotes       522 non-null object
imdbRating      522 non-null float64
Genre           534 non-null object
Rated           302 non-null object
Type            540 non-null object
dtypes: float64(2), object(7)
memory usage: 47.8+ KB


Unnamed: 0,unique_unsep,unique_sep,Runtime,Released,imdbVotes,imdbRating,Genre,Rated,Type
0,AaronStone,Aaron Stone,,01 Feb 2009,704.0,7.0,"Action, Adventure, Family",TV-Y7,series
1,AbsolutelyFabulous,Absolutely Fabulous,45.0,24 Jul 1994,15581.0,8.2,Comedy,TV-14,series
2,AccordingToJim,According To Jim,30.0,03 Oct 2001,30998.0,6.4,"Comedy, Romance",TV-PG,series
3,AceOfCakes,Ace Of Cakes,21.0,17 Aug 2006,1117.0,7.2,Reality-TV,,series
4,AdventuresInWonderland,Adventures In Wonderland,30.0,,443.0,8.3,"Comedy, Family, Fantasy",,series
5,AfterLately,After Lately,,06 Mar 2011,529.0,6.6,Comedy,,series
6,AfterMASH,AfterMASH,,,,,,,
7,AfterschoolSpecial,Afterschool Special,12.0,,,,"Short, Drama",,movie
8,AgainstTheWall,Against The Wall,111.0,26 Mar 1994,1879.0,6.8,"Action, Drama, History",R,movie
9,AlarmFuerCobra11,Alarm Fuer Cobra11,,,,,,,


In [108]:
dummy = pd.get_dummies(data[['Genre','Rated','Type']], dummy_na=True)
dummy.head(n=10)

Unnamed: 0,"Genre_Action, Adventure","Genre_Action, Adventure, Comedy","Genre_Action, Adventure, Crime","Genre_Action, Adventure, Drama","Genre_Action, Adventure, Family","Genre_Action, Adventure, Fantasy","Genre_Action, Adventure, Mystery","Genre_Action, Adventure, Romance","Genre_Action, Adventure, Sci-Fi","Genre_Action, Adventure, Western",...,Rated_TV-G,Rated_TV-MA,Rated_TV-PG,Rated_TV-Y,Rated_TV-Y7,Rated_UNRATED,Rated_nan,Type_movie,Type_series,Type_nan
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1


In [107]:
print(data.shape, dummy.shape)

(679, 9) (679, 203)


In [78]:
data.Genre.split(', ')

AttributeError: 'Series' object has no attribute 'split'

In [71]:
data = data.dropna()
data.unique_sep = data.unique_sep.astype('str')
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 284 entries, 1 to 678
Data columns (total 9 columns):
unique_unsep    284 non-null object
unique_sep      284 non-null object
Runtime         284 non-null float64
Released        284 non-null object
imdbVotes       284 non-null object
imdbRating      284 non-null float64
Genre           284 non-null object
Rated           284 non-null object
Type            284 non-null object
dtypes: float64(2), object(7)
memory usage: 22.2+ KB


In [8]:
datetime.strptime(data.Released[0], '%d %b %Y')

datetime.datetime(2009, 2, 1, 0, 0)

In [79]:
temp = pd.DataFrame(data['Released'].str.split(', '))
temp.Released[0]

['01 Feb 2009']

In [92]:
temp2 = temp.Released.values.tolist()

i=0
temp3 = [str(i) for i in temp2[i]]
print(temp3)

['01 Feb 2009']


In [84]:
type(temp2.astype('str'))

AttributeError: 'list' object has no attribute 'astype'

In [89]:
pd.DataFrame([temp3])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,669,670,671,672,673,674,675,676,677,678
0,['01 Feb 2009'],['24 Jul 1994'],['03 Oct 2001'],['17 Aug 2006'],,['06 Mar 2011'],,,['26 Mar 1994'],,...,['08 Mar 2005'],['07 Oct 2009'],,['07 Jul 2009'],['21 Sep 2006'],['04 Mar 1992'],,['15 Jun 2009'],['17 Jul 2011'],['11 Aug 2006']


In [56]:
pd.DataFrame([['01', 'Feb', '2009'],], columns = ['day','month','year'])

NameError: name 'nan' is not defined

## FunctionTransformer() and FeatureUnion()
FT - used to turn a normal function into an object that a pipeline can understand

1. Create FT() to separate text and numeric data.

In [123]:
data.head()
data['imdbVotes'] = pd.to_numeric(data['imdbVotes'].astype(str).str.replace(',',''), errors='coerce')
data.info()
data.head(n = 20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 679 entries, 0 to 678
Data columns (total 9 columns):
unique_unsep    679 non-null object
unique_sep      679 non-null object
Runtime         492 non-null float64
Released        505 non-null object
imdbVotes       522 non-null float64
imdbRating      522 non-null float64
Genre           534 non-null object
Rated           302 non-null object
Type            540 non-null object
dtypes: float64(3), object(6)
memory usage: 47.8+ KB


Unnamed: 0,unique_unsep,unique_sep,Runtime,Released,imdbVotes,imdbRating,Genre,Rated,Type
0,AaronStone,Aaron Stone,,01 Feb 2009,704.0,7.0,"Action, Adventure, Family",TV-Y7,series
1,AbsolutelyFabulous,Absolutely Fabulous,45.0,24 Jul 1994,15581.0,8.2,Comedy,TV-14,series
2,AccordingToJim,According To Jim,30.0,03 Oct 2001,30998.0,6.4,"Comedy, Romance",TV-PG,series
3,AceOfCakes,Ace Of Cakes,21.0,17 Aug 2006,1117.0,7.2,Reality-TV,,series
4,AdventuresInWonderland,Adventures In Wonderland,30.0,,443.0,8.3,"Comedy, Family, Fantasy",,series
5,AfterLately,After Lately,,06 Mar 2011,529.0,6.6,Comedy,,series
6,AfterMASH,AfterMASH,,,,,,,
7,AfterschoolSpecial,Afterschool Special,12.0,,,,"Short, Drama",,movie
8,AgainstTheWall,Against The Wall,111.0,26 Mar 1994,1879.0,6.8,"Action, Drama, History",R,movie
9,AlarmFuerCobra11,Alarm Fuer Cobra11,,,,,,,


In [None]:
# select only certain datatypes.
sample_df.select_dtypes(include = ['float'])

# pass a dataframe, return str or numeric
get_text_data = FunctionTransformer(lambda x: x.select_dtype(include = ['str']))
get_numeric_data = FunctionTransformer(lambda x: x.select_dtype(include = np.number))