 # Entity Resolution
  
Given two datasets that describe the same entities, identify in the test set, given two ids, one each from the two datasets, whether the two ids refer to the same entity. The datasets contain listings of movies from Amazon and Rotten Tomatoes, contain descriptive information like the director, the stars and runtime of the movie.

### Entity resolution technique:
 
To perform entity resolution, we used the following steps:

a) Importing data: train.csv, amazon.csv, rottentomatoes.csv, test.csv and holdout.csv files were first imported as pandas dataframes

b) Cleaning the amazon and rotten tomatoes datasets: Preprocessing for amazon and rottentomatoes datasets included 
- converting all string/text data to uppercase
- splitting the star column in amazon to the individual stars, 
- selecting specific columns namely, id, time, director,star1-star6(for rottentomatoes) and star1-star2 (for amazon)
- replacing NaNs with empty strings wherever applicable
- some star columns in amazon contained minutes which were replaced with empty strings and stored the minutes data in the time column
- extracted hours and minutes from the time column and converted it to minutes (in integer format)
- removed special characters/whitespaces/digits from all text data


c) Merging training data with amazon and rottentomatoes data on id1 and id2 respectively
d) Computing edit distance (method for string comparison) for directors and for stars columnwise in both datasets. For directors,those with edit distance 0 were given a distance score of 0 otherwise 1. For star comparisons, a less stringent criteria of edit distance <= 5 was used to give a distance score of 0 for better generalization and to account for misspellings/truncated strings.
e) A new column count_star was created to store a total count of matched stars.

f) An absolute difference of time was taken and values less then 10 were given a target value of 0 else 1.

g) For training data, we selected 3 columns namely, edit distances for directors, time and count_star. We split the data in a 80:20 ratio for training and testing. We blacklisted certain rows in the training data set which we felt were misclassified.

h) We built different models using Decision trees, Random Forests, K-NN classifier and Logistic Regression (with l1 and l2 penalties) and obtained precision, recall and f1 score of 100% on the training and test sets. We picked decision tree classification technique to predict on our test and holdout sets as we felt this would generalize better (tested later on instabase with predictions from all models). 

i) Test.csv and holdout.csv data was then merged with amazon and rottentomatoes data, the distance scores were computed in a similar way and then random forest classifer to predict the gold values.

j) Final scores on the test file were:
- Precision: 96.25%
- Recall: 96.25%
- F1 score: 96.25%



### Important features

Director, time and star columns (score) were used as these had information common to both datasets. Director and time were the primary filters and stricter criteria was used to match on these columns. Exact directior matching was done. For time, an absolute time difference of less than 10 minutes was required for a match. For stars, the least stringent criteria was used and even a single match for a star between the two datasets counted as a match. 

### Avoiding pairwise matching

Since the train.csv file has the pairs on which we train the model, we avoid pairwise matching. We use a decision tree classifier to learn on the training data and then predict on the test and holdout sets. Since we do not have a comprehensive and complete movie data set, it is not practical to do a pairwise matching.




##### Importing Amazon data

In [1]:
import pandas as pd
amazon = pd.read_csv('/Users/kaavyachinniah/Downloads/amazon.csv')
amazon

Unnamed: 0,id,time,director,star,cost
0,1,"1 hour, 33 minutes",Robert Rodriguez,"Taylor Lautner, Taylor Dooley","Rent HD $3.99,Rent SD $2.99,Buy HD $13.99,Buy ..."
1,2,4/12/2015,Michael Slovis,53 minutes,"Buy SD $2.99,Buy HD $38.99,Buy SD $28.99,"
2,3,"2 hours, 37 minutes",Anthony Russo,"Chris Evans, Samuel L. Jackson","Buy HD $19.99,Buy SD $14.99,"
3,4,"2 hours, 26 minutes",David Yates,"Bill Nighy, Emma Watson","Rent HD $3.99,Rent SD $2.99,Buy HD $9.99,Buy S..."
4,5,"2 hours, 1 minute",James Gunn,"Chris Pratt, Zoe Saldana","Buy HD $19.99,Buy SD $14.99,"
5,6,44 minutes,Richard J. Lewis,"Jim Caviezel, Kevin Chapman","Buy SD $1.99,Buy HD $34.99,Buy SD $34.99,"
6,7,"2 hours, 11 minutes",Justin Lin,"Vin Diesel, Paul Walker","Rent HD $3.99,Rent SD $2.99,Buy HD $12.99,Buy ..."
7,8,"2 hours, 21 minutes",Ron Howard,"Tom Hanks, Bill Paxton","Rent HD $3.99,Rent SD $2.99,Buy HD $9.99,Buy S..."
8,9,"1 hour, 58 minutes",Joel Coen,"Jeff Bridges, John Goodman","Rent HD $3.99,Rent SD $2.99,Buy HD $12.99,Buy ..."
9,10,"1 hour, 12 minutes",Curt Geda,"Justin Gross, Grey Griffin","Rent SD $2.99,Buy HD $9.99,Buy SD $9.99,"


###### Cleaning Amazon Data

In [2]:
# converting director to upper case
amazon['director'] = amazon.director.str.upper()

# splitting star to star1 and star2 and converting to upper case
amazon['star1'], amazon['star2'] = amazon['star'].str.split(',', 1).str
amazon['star1'] = amazon.star1.str.upper()
amazon['star2'] = amazon.star2.str.upper()

# Picking specific columns
df1=amazon[['id','time','director','star1','star2']]

# replacing NaN with blank
import numpy as np
df1 = df1.replace(np.nan, '', regex=True)

# replacing the star column which contains minutes to blank and storing these minutes in time column
y=df1[df1['star1'].str.contains('MINUTES')==True]
idx = df1.index.intersection(y.index)
df1.ix[idx,'time']=df1.ix[idx,'star1']
df1.ix[idx,'star1']=""
df1['time'] = df1.time.str.upper()

# Extract hours and minutes and convert to minutes
df1 = df1.replace('HOURS', 'HOUR', regex=True)
df1 = df1.replace('MINUTES', 'MINUTE', regex=True)
df1['hour']=df1['time'].str.extract("(\d+) HOUR")
df1['min']=df1['time'].str.extract("(\d+) MINUTE")
df1 = df1.replace(np.nan, '', regex=True)

df1['hour'] = df1['hour'].apply(pd.to_numeric, args=('coerce',))
df1['min'] = df1['min'].apply(pd.to_numeric, args=('coerce',))
df1=df1.fillna(0)
df1['tt']=df1['hour']*60+df1['min']
df1['time']=df1['tt'].astype(int)
df1=df1[['id','time','director','star1','star2']]

# Removing special characters, white space, digit in director and stars
import string
df1['director'] = df1['director'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation+string.whitespace]))
df1['star1'] = df1['star1'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation+string.whitespace]))
df1['star1'] = df1['star1'].apply(lambda x: ''.join([i for i in x if i in string.ascii_uppercase + string.digits]))
df1['star2'] = df1['star2'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation+string.whitespace]))
df1['star2'] = df1['star2'].apply(lambda x: ''.join([i for i in x if i in string.ascii_uppercase + string.digits]))

am_final=df1
am_final = am_final.rename(columns={'id': 'id1'})



In [3]:
am_final

Unnamed: 0,id1,time,director,star1,star2
0,1,93,ROBERTRODRIGUEZ,TAYLORLAUTNER,TAYLORDOOLEY
1,2,53,MICHAELSLOVIS,,
2,3,157,ANTHONYRUSSO,CHRISEVANS,SAMUELLJACKSON
3,4,146,DAVIDYATES,BILLNIGHY,EMMAWATSON
4,5,121,JAMESGUNN,CHRISPRATT,ZOESALDANA
5,6,44,RICHARDJLEWIS,JIMCAVIEZEL,KEVINCHAPMAN
6,7,131,JUSTINLIN,VINDIESEL,PAULWALKER
7,8,141,RONHOWARD,TOMHANKS,BILLPAXTON
8,9,118,JOELCOEN,JEFFBRIDGES,JOHNGOODMAN
9,10,72,CURTGEDA,JUSTINGROSS,GREYGRIFFIN


##### Importing Rotten tomatoes data

In [4]:
rt = pd.read_csv('/Users/kaavyachinniah/Downloads/rotten_tomatoes.csv',encoding='latin-1')
rt

Unnamed: 0,id,time,director,year,star1,star2,star3,star4,star5,star6,rotten_tomatoes,audience_rating,review1,review2,review3,review4,review5
0,1,1 hr. 45 min.,Alex March,1969,Ryan O'Neal,Leigh Taylor-Young,Van Heflin,Lee Grant,James Daly,Robert Webber,0,29,Alex March takes his sweet time getting us to ...,,,,
1,2,1 hr. 20 min.,Young Man Kang,2001,Ron Becks,Luciano Saber,Soo J. Kim,Renatta Mitchell,Hiromi Nishiyama,Jong O. Chung,0,0,High-energy film kick.,,,,
2,3,2 hr. 24 min.,Cheung Yuen-ting,1996,Maggie Cheung,Vivian Wu,Winston Chao,Hsing-kuo Wu,Jiang Wen,Elaine Jin,0,80,,,,,
3,4,1 hr. 46 min.,Andrew V. McLaglen,1968,James Stewart,Dean Martin,George Kennedy,Raquel Welch,Andrew Prine,Clint Ritchie,20,49,,,,,Buoyed by a strong cast and carried by an inte...
4,5,1 hr. 40 min.,Tony Richardson,1970,Jack Allan,Claire Balmford,Tony Bazell,Kurt Beimel,Ben Blakeney,Michael Boddy,43,28,,,,"Clunky, slow Mick Jagger starrer directed by T...","Curiously though, as it so often does, the rar..."
5,6,1 hr. 34 min.,John Mackenzie,2003,Michael Caine,Judith GodrÌ¬che,Michael Keaton,Rade Serbedzija,William Beck,Matthew Marsh,0,19,Caine admits in the DVD interviews that he did...,Even fans of either Michael will find themselv...,,,
6,7,2 hr. 12 min.,Ken Annakin,1965,James Fox,Sarah Miles,Stuart Whitman,Jean-Pierre Cassel,Alberto Sordi,Robert Morley,71,77,,Top gun of slapstick early-aviation comedies.,"Funny, zany and full of magnificent vintage ai...","Goofy ""great race"" fun.","...part slapstick, part spectacle, and part ad..."
7,8,1 hr. 28 min.,Masaaki Tezuka,2002,Yumiko Shaku,Shin Takuma,Kana Onodera,Koh Takasugi,Yusuke Tomoi,Jun'ichi Mizuno,0,69,"A serviceable movie, but overall not a very fu...",,,,
8,9,,See-Yuen Ng,1977,John Liu,Jang Lee Hwang,Chiang Wang,Corey Yuen,Yuen Kwai,Wong Cheng Li,0,50,,,,,
9,10,1 hr. 47 min.,Dante Lam,2003,Charlene Choi,Gillian Chung,Ekin Cheng,,,,0,53,The martial arts action is frequently fun to w...,,,,


##### Cleaning Rotten Tomato Data

In [5]:
import string

rt = rt.apply(lambda x: x.astype(str).str.upper())

rt['director'] = rt['director'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation+string.whitespace]))
rt['star1'] = rt['star1'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation+string.whitespace]))
rt['star1'] = rt['star1'].apply(lambda x: ''.join([i for i in x if i in string.ascii_uppercase + string.digits]))
rt['star2'] = rt['star2'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation+string.whitespace]))
rt['star2'] = rt['star2'].apply(lambda x: ''.join([i for i in x if i in string.ascii_uppercase + string.digits]))
rt['star3'] = rt['star3'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation+string.whitespace]))
rt['star3'] = rt['star3'].apply(lambda x: ''.join([i for i in x if i in string.ascii_uppercase + string.digits]))
rt['star4'] = rt['star4'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation+string.whitespace]))
rt['star4'] = rt['star4'].apply(lambda x: ''.join([i for i in x if i in string.ascii_uppercase + string.digits]))
rt['star5'] = rt['star5'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation+string.whitespace]))
rt['star5'] = rt['star5'].apply(lambda x: ''.join([i for i in x if i in string.ascii_uppercase + string.digits]))
rt['star6'] = rt['star6'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation+string.whitespace]))
rt['star6'] = rt['star6'].apply(lambda x: ''.join([i for i in x if i in string.ascii_uppercase + string.digits]))



rt['hour']=rt['time'].str.extract("(\d+) HR.")
rt['min']=rt['time'].str.extract("(\d+) MIN.")
rt['hour'] = rt['hour'].replace(np.nan, '0', regex=True)
rt['min'] = rt['min'].replace(np.nan, '0', regex=True)

rt['total_time'] = rt['hour'].astype(int) * 60 + rt['min'].astype(int)

# rt.loc[rt['total_time']>=99999,'total_time'] = 0

rt_final = rt[[0,2,4,5,6,7,8,9,19]]
rt_final = rt_final.replace('NAN','',regex = True)

rt_final = rt_final.rename(columns={'id': 'id2'})



In [6]:
am_final = am_final.add_suffix('_am')
am_final = am_final.rename(columns={'id1_am': 'id1'})
rt_final = rt_final.add_suffix('_rt')
rt_final = rt_final.rename(columns={'id2_rt': 'id2'})
rt_final = rt_final.rename(columns={'total_time_rt': 'time_rt'})

In [7]:
rt_final

Unnamed: 0,id2,director_rt,star1_rt,star2_rt,star3_rt,star4_rt,star5_rt,star6_rt,time_rt
0,1,ALEXMARCH,RYANONEAL,LEIGHTAYLORYOUNG,VANHEFLIN,LEEGRANT,JAMESDALY,ROBERTWEBBER,105
1,2,YOUNGMANKANG,RONBECKS,LUCIANOSABER,SOOJKIM,RENATTAMITCHELL,HIROMINISHIYAMA,JONGOCHUNG,80
2,3,CHEUNGYUENTING,MAGGIECHEUNG,VIVIANWU,WINSTONCHAO,HSINGKUOWU,JIANGWEN,ELAINEJIN,144
3,4,ANDREWVMCLAGLEN,JAMESSTEWART,DEANMARTIN,GEORGEKENNEDY,RAQUELWELCH,ANDREWPRINE,CLINTRITCHIE,106
4,5,TONYRICHARDSON,JACKALLAN,CLAIREBALMFORD,TONYBAZELL,KURTBEIMEL,BENBLAKENEY,MICHAELBODDY,100
5,6,JOHNMACKENZIE,MICHAELCAINE,JUDITHGODRCHE,MICHAELKEATON,RADESERBEDZIJA,WILLIAMBECK,MATTHEWMARSH,94
6,7,KENAKIN,JAMESFOX,SARAHMILES,STUARTWHITMAN,JEANPIERRECASSEL,ALBERTOSORDI,ROBERTMORLEY,132
7,8,MASAAKITEZUKA,YUMIKOSHAKU,SHINTAKUMA,KANAONODERA,KOHTAKASUGI,YUSUKETOMOI,JUNICHIMIZUNO,88
8,9,SEEYUENNG,JOHNLIU,JANGLEEHWANG,CHIANGWANG,COREYYUEN,YUENKWAI,WONGCHENGLI,0
9,10,DANTELAM,CHARLENECHOI,GILLIANCHUNG,EKINCHENG,,,,107


##### Importing Training data

In [8]:
train = pd.read_csv('/Users/kaavyachinniah/Downloads/train.csv')
train = train.rename(columns={'id 2': 'id2'})
train

Unnamed: 0,id1,id2,gold
0,4,3668,0
1,7,1882,0
2,9,6068,0
3,16,2084,0
4,17,86,1
5,17,119,0
6,17,3391,0
7,25,4412,1
8,45,60,0
9,45,5424,0


##### Merging Train data with amazon and rotten tomatoes data on id1 and id2 respectively

In [9]:
mer_am=pd.merge(am_final, train, on='id1', how='inner')
mer_am[['id2']] = mer_am[['id2']].apply(pd.to_numeric)
rt_final[['id2']] = rt_final[['id2']].apply(pd.to_numeric)
mer_final=pd.merge(mer_am, rt_final, on='id2', how='inner')
mer_final['time_am']=mer_final['time_am'].astype(int)

In [10]:
mer_final

Unnamed: 0,id1,time_am,director_am,star1_am,star2_am,id2,gold,director_rt,star1_rt,star2_rt,star3_rt,star4_rt,star5_rt,star6_rt,time_rt
0,4,146,DAVIDYATES,BILLNIGHY,EMMAWATSON,3668,0,DAVIDYATES,DANIELRADCLIFFE,RUPERTGRINT,EMMAWATSON,RALPHFIENNES,MICHAELGAMBON,ALANRICKMAN,131
1,2363,103,DANIELBARBER,SEANHARRIS,,3668,0,DAVIDYATES,DANIELRADCLIFFE,RUPERTGRINT,EMMAWATSON,RALPHFIENNES,MICHAELGAMBON,ALANRICKMAN,131
2,7,131,JUSTINLIN,VINDIESEL,PAULWALKER,1882,0,JUSTINLIN,VINDIESEL,PAULWALKER,MICHELLERODRIGUEZ,JOHNORTIZ,JORDANABREWSTER,LAZALONSO,107
3,9,118,JOELCOEN,JEFFBRIDGES,JOHNGOODMAN,6068,0,JEFFMONEO,NADIALITZ,JUSTINKELLY,STEPHENMCHATTIE,JAMESLEGROS,DAVIDLAHAYE,ROSSIFSUTHERLAND,21
4,16,129,GUYRITCHIE,ROBERTDOWNEYJR,JUDELAW,2084,0,GUYRITCHIE,ROBERTDOWNEYJR,JUDELAW,RACHELMCADAMS,MARKSTRONGII,EDDIEMARSAN,ROBERTMAILLET,129
5,17,201,PETERJACKSON,NOELAPPLEBY,ALIASTIN,86,1,PETERJACKSON,ELIJAHWOOD,SEANASTIN,IANMCKELLEN,ANDYSERKIS,VIGGOMORTENSEN,ORLANDOBLOOM,201
6,17,201,PETERJACKSON,NOELAPPLEBY,ALIASTIN,119,0,PETERCHUNG,VINDIESEL,NICKCHINLUND,KEITHDAVID,RHIANAGRIFFITH,,,35
7,17,201,PETERJACKSON,NOELAPPLEBY,ALIASTIN,3391,0,PETERJACKSON,MARTINFREEMANII,IANMCKELLEN,RICHARDARMITAGE,IANHOLM,ELIJAHWOOD,CATEBLANCHETT,144
8,65,180,PETERJACKSON,BRUCEALLPRESS,SEANASTIN,3391,0,PETERJACKSON,MARTINFREEMANII,IANMCKELLEN,RICHARDARMITAGE,IANHOLM,ELIJAHWOOD,CATEBLANCHETT,144
9,25,165,CHRISTOPHERNOLAN,CHRISTIANBALE,,4412,1,CHRISTOPHERNOLAN,CHRISTIANBALE,ANNEHATHAWAY,TOMHARDY,MARIONCOTILLARD,JOSEPHGORDONLEVITT,MICHAELCAINE,165


##### Computing edit distance

In [15]:
df=mer_final
from nltk.metrics import edit_distance 
df["director"] = df.loc[:, ["director_am","director_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["time"] = df.apply(lambda x: (df["time_am"]-df["time_rt"]).abs() if (df["time_am"]-df["time_rt"]).abs() == 0 else 1,axis=1)
# (df["time_am"]-df["time_rt"]).abs()
# df["time"]=(df["time_am"]-df["time_rt"]).abs().mask((df["time_am"]-df["time_rt"]).abs() !=0, 1)

df["t"]=(df["time_am"]-df["time_rt"]).abs()
df["time"]=df["t"].mask(df["t"]<10, 0)
df["time"]=df["time"].mask(df["time"]>9, 1)

# df["star11"] = df.loc[:, ["star1_am","star1_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["star12"] = df.loc[:, ["star1_am","star2_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["star13"] = df.loc[:, ["star1_am","star3_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["star14"] = df.loc[:, ["star1_am","star4_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["star15"] = df.loc[:, ["star1_am","star5_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["star16"] = df.loc[:, ["star1_am","star6_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["star21"] = df.loc[:, ["star2_am","star1_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["star22"] = df.loc[:, ["star2_am","star2_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["star23"] = df.loc[:, ["star2_am","star3_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["star24"] = df.loc[:, ["star2_am","star4_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["star25"] = df.loc[:, ["star2_am","star5_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["star26"] = df.loc[:, ["star2_am","star6_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)

df["star11"] = df.loc[:, ["star1_am","star1_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
df["star12"] = df.loc[:, ["star1_am","star2_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
df["star13"] = df.loc[:, ["star1_am","star3_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
df["star14"] = df.loc[:, ["star1_am","star4_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
df["star15"] = df.loc[:, ["star1_am","star5_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
df["star16"] = df.loc[:, ["star1_am","star6_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
df["star21"] = df.loc[:, ["star2_am","star1_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
df["star22"] = df.loc[:, ["star2_am","star2_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
df["star23"] = df.loc[:, ["star2_am","star3_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
df["star24"] = df.loc[:, ["star2_am","star4_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
df["star25"] = df.loc[:, ["star2_am","star5_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
df["star26"] = df.loc[:, ["star2_am","star6_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)


df["count_star"]=(df[["star11","star12","star13","star14","star15","star16","star21","star22","star23","star24","star25","star26"]] == 0).astype(int).sum(axis=1)

In [16]:
df

Unnamed: 0,id1,time_am,director_am,star1_am,star2_am,id2,gold,director_rt,star1_rt,star2_rt,...,star14,star15,star16,star21,star22,star23,star24,star25,star26,count_star
0,4,146,DAVIDYATES,BILLNIGHY,EMMAWATSON,3668,0,DAVIDYATES,DANIELRADCLIFFE,RUPERTGRINT,...,1,1,1,1,1,0,1,1,1,1
1,2363,103,DANIELBARBER,SEANHARRIS,,3668,0,DAVIDYATES,DANIELRADCLIFFE,RUPERTGRINT,...,1,1,1,1,1,1,1,1,1,0
2,7,131,JUSTINLIN,VINDIESEL,PAULWALKER,1882,0,JUSTINLIN,VINDIESEL,PAULWALKER,...,1,1,1,1,0,1,1,1,1,2
3,9,118,JOELCOEN,JEFFBRIDGES,JOHNGOODMAN,6068,0,JEFFMONEO,NADIALITZ,JUSTINKELLY,...,1,1,1,1,1,1,1,1,1,0
4,16,129,GUYRITCHIE,ROBERTDOWNEYJR,JUDELAW,2084,0,GUYRITCHIE,ROBERTDOWNEYJR,JUDELAW,...,1,1,1,1,0,1,1,1,1,2
5,17,201,PETERJACKSON,NOELAPPLEBY,ALIASTIN,86,1,PETERJACKSON,ELIJAHWOOD,SEANASTIN,...,1,1,1,1,4,1,1,1,1,0
6,17,201,PETERJACKSON,NOELAPPLEBY,ALIASTIN,119,0,PETERCHUNG,VINDIESEL,NICKCHINLUND,...,1,1,1,1,1,1,1,1,1,0
7,17,201,PETERJACKSON,NOELAPPLEBY,ALIASTIN,3391,0,PETERJACKSON,MARTINFREEMANII,IANMCKELLEN,...,1,1,1,1,1,1,1,1,1,0
8,65,180,PETERJACKSON,BRUCEALLPRESS,SEANASTIN,3391,0,PETERJACKSON,MARTINFREEMANII,IANMCKELLEN,...,1,1,1,1,1,1,1,1,1,0
9,25,165,CHRISTOPHERNOLAN,CHRISTIANBALE,,4412,1,CHRISTOPHERNOLAN,CHRISTIANBALE,ANNEHATHAWAY,...,1,1,1,1,1,1,1,1,1,1


Dropping rows which looked misclassified

In [17]:
df=df.drop(df.index[[4,14,84,114,115,176,215,5,17,50,80,106,117,228]])
# df=df.drop(df.index[[4,14,84,114,115,176,215]])
# df=df.drop(df.index[[5,17,50,80,106,117,228]])

In [18]:
df

Unnamed: 0,id1,time_am,director_am,star1_am,star2_am,id2,gold,director_rt,star1_rt,star2_rt,...,star14,star15,star16,star21,star22,star23,star24,star25,star26,count_star
0,4,146,DAVIDYATES,BILLNIGHY,EMMAWATSON,3668,0,DAVIDYATES,DANIELRADCLIFFE,RUPERTGRINT,...,1,1,1,1,1,0,1,1,1,1
1,2363,103,DANIELBARBER,SEANHARRIS,,3668,0,DAVIDYATES,DANIELRADCLIFFE,RUPERTGRINT,...,1,1,1,1,1,1,1,1,1,0
2,7,131,JUSTINLIN,VINDIESEL,PAULWALKER,1882,0,JUSTINLIN,VINDIESEL,PAULWALKER,...,1,1,1,1,0,1,1,1,1,2
3,9,118,JOELCOEN,JEFFBRIDGES,JOHNGOODMAN,6068,0,JEFFMONEO,NADIALITZ,JUSTINKELLY,...,1,1,1,1,1,1,1,1,1,0
6,17,201,PETERJACKSON,NOELAPPLEBY,ALIASTIN,119,0,PETERCHUNG,VINDIESEL,NICKCHINLUND,...,1,1,1,1,1,1,1,1,1,0
7,17,201,PETERJACKSON,NOELAPPLEBY,ALIASTIN,3391,0,PETERJACKSON,MARTINFREEMANII,IANMCKELLEN,...,1,1,1,1,1,1,1,1,1,0
8,65,180,PETERJACKSON,BRUCEALLPRESS,SEANASTIN,3391,0,PETERJACKSON,MARTINFREEMANII,IANMCKELLEN,...,1,1,1,1,1,1,1,1,1,0
9,25,165,CHRISTOPHERNOLAN,CHRISTIANBALE,,4412,1,CHRISTOPHERNOLAN,CHRISTIANBALE,ANNEHATHAWAY,...,1,1,1,1,1,1,1,1,1,1
10,45,119,STEVENSPIELBERG,HARRISONFORD,KATECAPSHAW,60,0,STEVENRAMIREZ,JOEFALOU,RAWIRIPARATENE,...,1,1,1,1,1,1,1,1,1,0
11,45,119,STEVENSPIELBERG,HARRISONFORD,KATECAPSHAW,5424,0,BENSTILLER,BENSTILLER,KRISTENWIIG,...,1,1,1,1,1,1,1,1,1,0


##### Splitting into training and testing sets

In [19]:
X=pd.DataFrame(df[['time','director',"count_star"]])
y=pd.DataFrame(df['gold'])
X=X.as_matrix()
y=y.as_matrix()

In [80]:
#df.to_csv('/Users/kaavyachinniah/Downloads/finn.csv', sep='\t')

Scoring function

In [20]:
def score(X,y,clf):
    print('Score: {:.3f} %'.format(clf.score(X,y)*100))
    print('Precision: {:.3f} %'.format(precision_score(y, clf.predict(X),average = 'binary')*100))
    print('Recall: {:.3f} %'.format(recall_score(y, clf.predict(X))*100))
    print('F1 Score: {:.3f} %'.format(f1_score(y, clf.predict(X))*100))

### Decision Tree Classifier

In [21]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42,test_size=0.2)

In [23]:
from sklearn.metrics import precision_score,recall_score,f1_score
# X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42,test_size=0.2)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

print("Training train Set")
print("------------------")
score(X_train, y_train,clf)
print(" ")
print("Training test Set")
print("-----------------")
score(X_test, y_test,clf)

Training train Set
------------------
Score: 100.000 %
Precision: 100.000 %
Recall: 100.000 %
F1 Score: 100.000 %
 
Training test Set
-----------------
Score: 100.000 %
Precision: 100.000 %
Recall: 100.000 %
F1 Score: 100.000 %


### K Neighbours Classifier

In [24]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)

print("Training train Set")
print("------------------")
score(X_train, y_train,clf)
print(" ")
print("Training test Set")
print("-----------------")
score(X_test, y_test,clf)

Training train Set
------------------
Score: 100.000 %
Precision: 100.000 %
Recall: 100.000 %
F1 Score: 100.000 %
 
Training test Set
-----------------
Score: 100.000 %
Precision: 100.000 %
Recall: 100.000 %
F1 Score: 100.000 %


  from ipykernel import kernelapp as app


### Random Forest Classifier

In [25]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
# rf.fit(X_train, y_train)
rf.fit(X, y)
print("Training train Set")
print("------------------")
score(X, y,rf)
# print(" ")
# print("Training test Set")
# print("-----------------")
# score(X_test, y_test,rf)

Training train Set
------------------
Score: 100.000 %
Precision: 100.000 %
Recall: 100.000 %
F1 Score: 100.000 %




### Logistic Regression

In [27]:
from sklearn.linear_model import LogisticRegression
lg1 = LogisticRegression(penalty='l1').fit(X_train, y_train)

print("Training train Set")
print("------------------")
score(X_train, y_train,lg1)
print(" ")
print("Training test Set")
print("-----------------")
score(X_test, y_test,lg1)


Training train Set
------------------
Score: 100.000 %
Precision: 100.000 %
Recall: 100.000 %
F1 Score: 100.000 %
 
Training test Set
-----------------
Score: 100.000 %
Precision: 100.000 %
Recall: 100.000 %
F1 Score: 100.000 %


  y = column_or_1d(y, warn=True)


##### Importing Test set

In [28]:
test = pd.read_csv('/Users/kaavyachinniah/Downloads/test.csv')
test = test.rename(columns={'id 2': 'id2'})
test

Unnamed: 0,id1,id2
0,2817,3952
1,2817,5377
2,2832,2784
3,2840,5282
4,2840,6343
5,2852,4772
6,2862,3952
7,2862,4425
8,2867,3094
9,2891,4138


##### Merging Test data with amazon and rotten tomatoes data on id1 and id2 respectively

In [29]:
mer_am_t=pd.merge(am_final, test, on='id1', how='inner')
mer_am_t[['id2']] = mer_am_t[['id2']].apply(pd.to_numeric)
mer_final_t=pd.merge(mer_am_t, rt_final, on='id2', how='inner')
mer_final_t['time_am']=mer_final_t['time_am'].astype(int)

In [30]:
mer_final_t

Unnamed: 0,id1,time_am,director_am,star1_am,star2_am,id2,director_rt,star1_rt,star2_rt,star3_rt,star4_rt,star5_rt,star6_rt,time_rt
0,2817,158,MARTINBREST,ALPACINO,CHRISODONNELL,3952,MARIUSHOLST,STELLANSKARSGARD,KRISTOFFERJONER,BENJAMINHELSTAD,ELLENDORRITPETERSE,TRONDNILSSEN,MAGNUSLANGLETE,115
1,2862,180,MARTINSCORSESE,LEONARDODICAPRIO,JONAHHILL,3952,MARIUSHOLST,STELLANSKARSGARD,KRISTOFFERJONER,BENJAMINHELSTAD,ELLENDORRITPETERSE,TRONDNILSSEN,MAGNUSLANGLETE,115
2,3858,167,MARTINSCORSESE,LEONARDODICAPRIO,,3952,MARIUSHOLST,STELLANSKARSGARD,KRISTOFFERJONER,BENJAMINHELSTAD,ELLENDORRITPETERSE,TRONDNILSSEN,MAGNUSLANGLETE,115
3,2817,158,MARTINBREST,ALPACINO,CHRISODONNELL,5377,MARTINSCORSESE,LEONARDODICAPRIO,JONAHHILL,MARGOTROBBIE,MATTHEWMCCONAUGHEY,KYLECHANDLER,JOANNALUMLEY,179
4,2832,0,PETERJACKSON,IANMCKELLEN,,2784,PAULSAMPSON,PAULSAMPSON,UDOKIER,NORMANREEDUS,BILLYDRAGO,DAVIDCARRADINE,INGRIDSONRAY,100
5,2840,113,LEETOLANDKRIEGER,BLAKELIVELY,MICHIELHUISMAN,5282,JAKEGOLDBERGER,CUBAGOODINGJR,MALCOLMMAYS,DENNISHAYSBERT,PAULAJAIPARKER,LISAGAYHAMILTON,LISAGAYHAMILTON,100
6,2840,113,LEETOLANDKRIEGER,BLAKELIVELY,MICHIELHUISMAN,6343,LEETOLANDKRIEGER,BLAKELIVELY,MICHIELHUISMAN,HARRISONFORD,ELLENBURSTYN,KATHYBAKER,AMANDACREW,109
7,2852,109,MICHELGONDRY,JIMCARREY,KATEWINSLET,4772,MATTHEWMISHORY,PRESTONJAMESHILLIE,DANGLENN,DALILAHRAIN,EDWARDSINGLETARYJR,ERINDANIELS,ROBERTGANT,92
8,2862,180,MARTINSCORSESE,LEONARDODICAPRIO,JONAHHILL,4425,CAROLMORLEY,ZAWEASHTON,ALIXLUKACAIN,NEELAMBAKSHI,CORNELLJOHN,DARENELLIOTTHOLMES,KELLYAGBOWU,90
9,3858,167,MARTINSCORSESE,LEONARDODICAPRIO,,4425,CAROLMORLEY,ZAWEASHTON,ALIXLUKACAIN,NEELAMBAKSHI,CORNELLJOHN,DARENELLIOTTHOLMES,KELLYAGBOWU,90


##### Computing edit distance

In [31]:
dft=mer_final_t
from nltk.metrics import edit_distance 
dft["director"] = dft.loc[:, ["director_am","director_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["time"] = df.apply(lambda x: (df["time_am"]-df["time_rt"]).abs() if (df["time_am"]-df["time_rt"]).abs() == 0 else 1,axis=1)
# (df["time_am"]-df["time_rt"]).abs()
# dft["time"]=(dft["time_am"]-dft["time_rt"]).abs().mask((dft["time_am"]-dft["time_rt"]).abs() !=0, 1)

dft["t"]=(dft["time_am"]-dft["time_rt"]).abs()
dft["time"]=dft["t"].mask(dft["t"]<10, 0)
dft["time"]=dft["time"].mask(dft["time"]>9, 1)

# dft["star11"] = dft.loc[:, ["star1_am","star1_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# dft["star12"] = dft.loc[:, ["star1_am","star2_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# dft["star13"] = dft.loc[:, ["star1_am","star3_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# dft["star14"] = dft.loc[:, ["star1_am","star4_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# dft["star15"] = dft.loc[:, ["star1_am","star5_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# dft["star16"] = dft.loc[:, ["star1_am","star6_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# dft["star21"] = dft.loc[:, ["star2_am","star1_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# dft["star22"] = dft.loc[:, ["star2_am","star2_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# dft["star23"] = dft.loc[:, ["star2_am","star3_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# dft["star24"] = dft.loc[:, ["star2_am","star4_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# dft["star25"] = dft.loc[:, ["star2_am","star5_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# dft["star26"] = dft.loc[:, ["star2_am","star6_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)

dft["star11"] = dft.loc[:, ["star1_am","star1_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dft["star12"] = dft.loc[:, ["star1_am","star2_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dft["star13"] = dft.loc[:, ["star1_am","star3_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dft["star14"] = dft.loc[:, ["star1_am","star4_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dft["star15"] = dft.loc[:, ["star1_am","star5_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dft["star16"] = dft.loc[:, ["star1_am","star6_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dft["star21"] = dft.loc[:, ["star2_am","star1_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dft["star22"] = dft.loc[:, ["star2_am","star2_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dft["star23"] = dft.loc[:, ["star2_am","star3_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dft["star24"] = dft.loc[:, ["star2_am","star4_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dft["star25"] = dft.loc[:, ["star2_am","star5_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dft["star26"] = dft.loc[:, ["star2_am","star6_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)


dft["count_star"]=(dft[["star11","star12","star13","star14","star15","star16","star21","star22","star23","star24","star25","star26"]] == 0).astype(int).sum(axis=1)


In [33]:
Xt=pd.DataFrame(dft[['time','director',"count_star"]])
# yt=pd.DataFrame(dft['gold'])
Xt=Xt.as_matrix()
# yt=yt.as_matrix()

In [57]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
Yt=clf.predict(Xt)

In [58]:
t1=dft[["id1","id2"]]

In [59]:
t1["gold"]=Yt

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [60]:
t1

Unnamed: 0,id1,id2,gold
0,2817,3952,0
1,2862,3952,0
2,3858,3952,0
3,2817,5377,0
4,2832,2784,0
5,2840,5282,0
6,2840,6343,1
7,2852,4772,0
8,2862,4425,0
9,3858,4425,0


In [61]:
mer_test=pd.merge(test, t1, on=['id1', 'id2'], how='inner')

In [62]:
mer_test

Unnamed: 0,id1,id2,gold
0,2817,3952,0
1,2817,5377,0
2,2832,2784,0
3,2840,5282,0
4,2840,6343,1
5,2852,4772,0
6,2862,3952,0
7,2862,4425,0
8,2867,3094,1
9,2891,4138,1


In [63]:
mer_test.to_csv('/Users/kaavyachinniah/Downloads/mer_test.csv', sep=',')

In [41]:
mer_final_t.to_csv('/Users/kaavyachinniah/Downloads/mer_final_t.csv', sep=',')

##### Importing Holdout set

In [42]:
holdout = pd.read_csv('/Users/kaavyachinniah/Downloads/holdout.csv')
holdout = holdout.rename(columns={'id 2': 'id2'})
holdout

Unnamed: 0,id1,id2
0,3967,759
1,3974,964
2,4016,2816
3,4027,1589
4,4032,4317
5,4039,661
6,4077,5743
7,4094,399
8,4094,2364
9,4124,4438


##### Merging holdout data with amazon and rotten tomatoes data on id1 and id2 respectively

In [43]:
mer_am_h=pd.merge(am_final, holdout, on='id1', how='inner')
mer_am_h[['id2']] = mer_am_h[['id2']].apply(pd.to_numeric)
rt_final[['id2']] = rt_final[['id2']].apply(pd.to_numeric)
mer_final_h=pd.merge(mer_am_h, rt_final, on='id2', how='inner')
mer_final_h['time_am']=mer_final_h['time_am'].astype(int)

In [44]:
mer_final_h

Unnamed: 0,id1,time_am,director_am,star1_am,star2_am,id2,director_rt,star1_rt,star2_rt,star3_rt,star4_rt,star5_rt,star6_rt,time_rt
0,11,99,MATTHEWHOPE,TOBYKEBBELL,ADIBIELSKI,2561,MATTHEWHOPE,TOBYKEBBELL,BRIANCOX,ADIBIELSKI,TOMBROOKE,TONYCURRAN,ASHLEYTHOMAS,100
1,43,119,GEORGECLOONEY,GEORGECLOONEY,MATTDAMON,3153,GEORGECLOONEY,GEORGECLOONEY,MATTDAMON,BILLMURRAY,JOHNGOODMAN,JEANDUJARDIN,BOBBALABAN,118
2,72,93,KRISTIANLEVRING,MADSMIKKELSEN,EVAGREEN,6288,KRISTIANLEVRING,MADSMIKKELSEN,EVAGREEN,JEFFREYDEANMORGAN,MIKAELPERSBRANDT,MICHAELRAYMONDJAME,ALEXANDERARNOLD,100
3,3967,122,SAMMENDES,KEVINSPACEY,ANNETTEBENING,759,SAMFIRSTENBERG,DAVIDBRADLEY,MARKDACASCOS,VALARIETRAPP,JOHNFUJIOKA,VALERIETRAPP,MELISSAHELLMAN,89
4,3974,82,CHARLESMAXWELL,COLINANDREWS,FRANCINEBLAKE,964,CHARLESTKANGANIS,SAMJJONES,HARRYGUARDINO,JOEYHOUSE,ABEVIGODA,BUBBASMITH,NICHOLASWORTH,90
5,5095,278,CHARLESJARROTT,RICHARDBURTON,,964,CHARLESTKANGANIS,SAMJJONES,HARRYGUARDINO,JOEYHOUSE,ABEVIGODA,BUBBASMITH,NICHOLASWORTH,90
6,4016,125,BILLCONDON,TAYLORLAUTNER,GILBIRMINGHAM,2816,BILLCONDON,KRISTENSTEWART,ROBERTPATTINSON,TAYLORLAUTNER,PETERFACINELLI,ELIZABETHREASER,ASHLEYGREENE,115
7,4027,77,RAOULMARTINEZ,MICHAELALBERT,STANLEYARONOWITZ,1589,GARYHARTLE,TOMKANE,FREDTATASCIORE,AIDANDRUMMOND,NOAHCRAWFORD,DEMPSEYPAPPION,BRENNAOBRIEN,84
8,4032,125,RODNEYRAY,SAMUELDAVIS,PERRYFROST,4317,RODNEYRAY,BRIANSAMUELDAVIS,BENDAVIES,PERRYFROST,JULIEKENDALL,WILLSCHWAB,DODIEBROWN,120
9,4039,117,MICHAELMANN,DANIELDAYLEWIS,,661,ANTHONYMANN,VICTORMATURE,ANNEBANCROFT,ROBERTPRESTON,GUYMADISON,JAMESWHITMORE,RUSSELLCOLLINS,98


##### Computing edit distance

In [45]:
dfh=mer_final_h
from nltk.metrics import edit_distance 
dfh["director"] = dfh.loc[:, ["director_am","director_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)==0 else 1, axis=1)
# df["time"] = df.apply(lambda x: (df["time_am"]-df["time_rt"]).abs() if (df["time_am"]-df["time_rt"]).abs() == 0 else 1,axis=1)
# (df["time_am"]-df["time_rt"]).abs()
# dfh["time"]=(dfh["time_am"]-dfh["time_rt"]).abs().mask((dfh["time_am"]-dfh["time_rt"]).abs() !=0, 1)

dfh["t"]=(dfh["time_am"]-dfh["time_rt"]).abs()
dfh["time"]=dfh["t"].mask(dfh["t"]<10, 0)
dfh["time"]=dfh["time"].mask(dfh["time"]>9, 1)

dfh["star11"] = dfh.loc[:, ["star1_am","star1_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dfh["star12"] = dfh.loc[:, ["star1_am","star2_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dfh["star13"] = dfh.loc[:, ["star1_am","star3_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dfh["star14"] = dfh.loc[:, ["star1_am","star4_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dfh["star15"] = dfh.loc[:, ["star1_am","star5_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dfh["star16"] = dfh.loc[:, ["star1_am","star6_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dfh["star21"] = dfh.loc[:, ["star2_am","star1_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dfh["star22"] = dfh.loc[:, ["star2_am","star2_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dfh["star23"] = dfh.loc[:, ["star2_am","star3_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dfh["star24"] = dfh.loc[:, ["star2_am","star4_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dfh["star25"] = dfh.loc[:, ["star2_am","star5_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)
dfh["star26"] = dfh.loc[:, ["star2_am","star6_rt"]].apply(lambda x: edit_distance(*x) if edit_distance(*x)<6 else 1, axis=1)

dfh["count_star"]=(dfh[["star11","star12","star13","star14","star15","star16","star21","star22","star23","star24","star25","star26"]] == 0).astype(int).sum(axis=1)



In [46]:
dfh

Unnamed: 0,id1,time_am,director_am,star1_am,star2_am,id2,director_rt,star1_rt,star2_rt,star3_rt,...,star14,star15,star16,star21,star22,star23,star24,star25,star26,count_star
0,11,99,MATTHEWHOPE,TOBYKEBBELL,ADIBIELSKI,2561,MATTHEWHOPE,TOBYKEBBELL,BRIANCOX,ADIBIELSKI,...,1,1,1,1,1,0,1,1,1,2
1,43,119,GEORGECLOONEY,GEORGECLOONEY,MATTDAMON,3153,GEORGECLOONEY,GEORGECLOONEY,MATTDAMON,BILLMURRAY,...,1,1,1,1,0,1,1,1,1,2
2,72,93,KRISTIANLEVRING,MADSMIKKELSEN,EVAGREEN,6288,KRISTIANLEVRING,MADSMIKKELSEN,EVAGREEN,JEFFREYDEANMORGAN,...,1,1,1,1,0,1,1,1,1,2
3,3967,122,SAMMENDES,KEVINSPACEY,ANNETTEBENING,759,SAMFIRSTENBERG,DAVIDBRADLEY,MARKDACASCOS,VALARIETRAPP,...,1,1,1,1,1,1,1,1,1,0
4,3974,82,CHARLESMAXWELL,COLINANDREWS,FRANCINEBLAKE,964,CHARLESTKANGANIS,SAMJJONES,HARRYGUARDINO,JOEYHOUSE,...,1,1,1,1,1,1,1,1,1,0
5,5095,278,CHARLESJARROTT,RICHARDBURTON,,964,CHARLESTKANGANIS,SAMJJONES,HARRYGUARDINO,JOEYHOUSE,...,1,1,1,1,1,1,1,1,1,0
6,4016,125,BILLCONDON,TAYLORLAUTNER,GILBIRMINGHAM,2816,BILLCONDON,KRISTENSTEWART,ROBERTPATTINSON,TAYLORLAUTNER,...,1,1,1,1,1,1,1,1,1,1
7,4027,77,RAOULMARTINEZ,MICHAELALBERT,STANLEYARONOWITZ,1589,GARYHARTLE,TOMKANE,FREDTATASCIORE,AIDANDRUMMOND,...,1,1,1,1,1,1,1,1,1,0
8,4032,125,RODNEYRAY,SAMUELDAVIS,PERRYFROST,4317,RODNEYRAY,BRIANSAMUELDAVIS,BENDAVIES,PERRYFROST,...,1,1,1,1,1,0,1,1,1,1
9,4039,117,MICHAELMANN,DANIELDAYLEWIS,,661,ANTHONYMANN,VICTORMATURE,ANNEBANCROFT,ROBERTPRESTON,...,1,1,1,1,1,1,1,1,1,0


In [47]:
Xh=pd.DataFrame(dfh[['time','director',"count_star"]])
# yt=pd.DataFrame(dft['gold'])
Xh=Xh.as_matrix()
# yt=yt.as_matrix()

In [64]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
Yh=clf.predict(Xh)

In [65]:
h1=dfh[["id1","id2"]]

In [66]:
h1["gold"]=Yh

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [67]:
mer_hold=pd.merge(holdout, h1, on=['id1', 'id2'], how='inner')

In [68]:
mer_hold.to_csv('/Users/kaavyachinniah/Downloads/mer_hold.csv', sep=',')

In [69]:
mer_final_h.to_csv('/Users/kaavyachinniah/Downloads/mer_final_h.csv', sep=',')