# MADE HW-2
## What genres has a movie with a provided dialogue?
[Kaggle competition](https://www.kaggle.com/c/made-hw-2/data)


### Outline <a name = 'outline'></a>
* [Data reading](#data) 
* Next

In [94]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import pipeline, model_selection, preprocessing, metrics

### Data Reading <a name = 'data'></a>

In [66]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [96]:
test.head()

Unnamed: 0,id,dialogue
0,0,Boy! Did you see the way Mama whopped that dep...
1,1,"Gordon, the insurance people are balking on th..."
2,2,Very fancy. Did you design the bottle? <BR> W...
3,3,It makes me so mad. Steven Schwimmer ready to ...
4,4,Something ought to loosen him up ... how comes...


In [68]:
train = train.drop(['id', 'movie'], axis=1)

In [84]:
train['genre_list'] = train['genres'].map(lambda x : x[1:-1].split(', '))

In [71]:
genres = set()
for x in train['genre_list']:
    genres |= set(x)

In [86]:
train_full = train.drop('genre_list', 1).join(train.genre_list.str.join('|').str.get_dummies())

RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb

In [129]:
#create 2-step pipeline: vectorizer and reclassifier
pipe = pipeline.Pipeline(steps = [('vectorizer', CountVectorizer(min_df = 100, stop_words={'english'})), 
                                  ('classifier', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1))])

In [89]:
train_train, train_test = model_selection.train_test_split(train_full, random_state=42, test_size=0.33, shuffle=True)

In [130]:
for g in genres:
    print('... Processing {}'.format(g))
    # train the model using X_dtm & y
    pipe.fit(train_train['dialogue'], train_train[g])
    # compute the testing accuracy
    prediction = pipe.predict(train_test['dialogue'])
    print('Test accuracy is {}'.format(metrics.accuracy_score(train_test[g], prediction)))

... Processing u'horror'




Test accuracy is 0.9672346002621232
... Processing u'comedy'




Test accuracy is 0.8486238532110092
... Processing u'mystery'




Test accuracy is 0.9284895150720839
... Processing u'adventure'




Test accuracy is 0.9291448230668414
... Processing u'family'




Test accuracy is 0.9961500655307994
... Processing u'history'




Test accuracy is 0.9945117955439057
... Processing u'fantasy'




Test accuracy is 0.9751802096985583
... Processing u'war'




Test accuracy is 0.9862385321100917
... Processing u'sci-fi'




Test accuracy is 0.9252948885976409
... Processing u'sport'




Test accuracy is 0.9952490170380078
... Processing u'western'




Test accuracy is 0.9956585845347313
... Processing u'animation'




Test accuracy is 0.9986074705111402
... Processing u'biography'




Test accuracy is 0.9770642201834863
... Processing u'thriller'




Test accuracy is 0.7805537352555701
... Processing u'action'




Test accuracy is 0.8529652686762779
... Processing u'music'




Test accuracy is 0.9920543905635649
... Processing u'musical'




Test accuracy is 0.9988532110091743
... Processing u'crime'




Test accuracy is 0.8618119266055045
... Processing u'romance'




Test accuracy is 0.8592726081258192
... Processing u'drama'




Test accuracy is 0.7848951507208388


In [133]:
result = test.copy()
for g in genres:
    print('... Processing {}'.format(g))
    # train the model using X_dtm & y
    pipe.fit(train_full['dialogue'], train_full[g])
    # compute the testing accuracy
    prediction = np.array([x[1] for x in pipe.predict_proba(test['dialogue'])])
    result[g] = prediction.tolist()

... Processing u'horror'




... Processing u'comedy'




... Processing u'mystery'




... Processing u'adventure'




... Processing u'family'




... Processing u'history'




... Processing u'fantasy'




... Processing u'war'




... Processing u'sci-fi'




... Processing u'sport'




... Processing u'western'




... Processing u'animation'




... Processing u'biography'




... Processing u'thriller'




... Processing u'action'




... Processing u'music'




... Processing u'musical'




... Processing u'crime'




... Processing u'romance'




... Processing u'drama'




In [134]:
result.head()

Unnamed: 0,id,dialogue,u'horror',u'comedy',u'mystery',u'adventure',u'family',u'history',u'fantasy',u'war',...,u'western',u'animation',u'biography',u'thriller',u'action',u'music',u'musical',u'crime',u'romance',u'drama'
0,0,Boy! Did you see the way Mama whopped that dep...,0.084586,0.064037,0.081484,0.013718,0.0102544,0.007888,0.026657,0.015082,...,0.036913,0.001768394,0.010669,0.489727,0.088973,0.014999,0.00324152,0.745424,0.247159,0.524
1,1,"Gordon, the insurance people are balking on th...",0.025715,0.132819,0.104734,0.084721,0.002430368,0.018534,0.01767,0.043876,...,0.006179,0.000626788,0.113516,0.251663,0.15945,0.008384,0.0008646914,0.103948,0.105419,0.912407
2,2,Very fancy. Did you design the bottle? <BR> W...,0.000356,0.070334,0.009895,0.085341,6.979026e-06,9e-06,0.001123,0.003821,...,1.9e-05,7.751727e-08,0.001168,0.035053,0.119129,3e-05,2.100333e-08,0.013524,0.012336,0.886206
3,3,It makes me so mad. Steven Schwimmer ready to ...,0.007306,0.080126,0.003609,0.063049,3.098677e-07,2e-06,0.008068,0.002707,...,6e-06,5.2318e-08,0.005598,0.028033,0.153437,0.000167,1.208801e-07,0.052534,0.162434,0.906562
4,4,Something ought to loosen him up ... how comes...,0.20035,0.367231,0.272128,0.088615,0.004623182,0.030383,0.032167,0.03593,...,0.025432,0.006648261,0.031456,0.45886,0.469866,0.012826,0.01227411,0.213798,0.264209,0.235865


In [135]:
result['answer'] = result.drop(['id', 'dialogue'], axis=1).idxmax(axis=1)

In [136]:
result.head()

Unnamed: 0,id,dialogue,u'horror',u'comedy',u'mystery',u'adventure',u'family',u'history',u'fantasy',u'war',...,u'animation',u'biography',u'thriller',u'action',u'music',u'musical',u'crime',u'romance',u'drama',answer
0,0,Boy! Did you see the way Mama whopped that dep...,0.084586,0.064037,0.081484,0.013718,0.0102544,0.007888,0.026657,0.015082,...,0.001768394,0.010669,0.489727,0.088973,0.014999,0.00324152,0.745424,0.247159,0.524,u'crime'
1,1,"Gordon, the insurance people are balking on th...",0.025715,0.132819,0.104734,0.084721,0.002430368,0.018534,0.01767,0.043876,...,0.000626788,0.113516,0.251663,0.15945,0.008384,0.0008646914,0.103948,0.105419,0.912407,u'drama'
2,2,Very fancy. Did you design the bottle? <BR> W...,0.000356,0.070334,0.009895,0.085341,6.979026e-06,9e-06,0.001123,0.003821,...,7.751727e-08,0.001168,0.035053,0.119129,3e-05,2.100333e-08,0.013524,0.012336,0.886206,u'drama'
3,3,It makes me so mad. Steven Schwimmer ready to ...,0.007306,0.080126,0.003609,0.063049,3.098677e-07,2e-06,0.008068,0.002707,...,5.2318e-08,0.005598,0.028033,0.153437,0.000167,1.208801e-07,0.052534,0.162434,0.906562,u'drama'
4,4,Something ought to loosen him up ... how comes...,0.20035,0.367231,0.272128,0.088615,0.004623182,0.030383,0.032167,0.03593,...,0.006648261,0.031456,0.45886,0.469866,0.012826,0.01227411,0.213798,0.264209,0.235865,u'action'


In [137]:
result['answer'] = result['answer'].map(lambda x : x[2:-1])

In [138]:
result_result = result[['id','answer']]

In [139]:
result_result = result_result.rename(columns={"answer": "genres"})

In [140]:
result_result.head()

Unnamed: 0,id,genres
0,0,crime
1,1,drama
2,2,drama
3,3,drama
4,4,action


In [141]:
result_result.to_csv ('result1.csv', index = False, header=True)
