# Machine Learning

## Recommend

This systems work primarly in 2 ways. 
- Analyze customers behavior and look at similarities among them. 
    - Ex. Cust1 likes (A,B). Cust2 likes (A,C). Cust1 may enjoy 'C'.
- Look and things that have similar characteristics or that are associated. 
    - Ex. Cust1 likes A. A has (a,b,c). C has (a,b,d). Cust1 may enjoy 'C'. 
    
Two interesting questions rise, how to know what customers like? what characteristics are important to create associations.

- How to know what customers like?
    - Primarily by votes (likes), ratings (4 stars) or browsed items.
- What characteristics are important to create associations? 
    - That needs more substantive expertise. Smurfs are blue, I like smurfs but I may not enjoy blue scarfs. Fortunately since the comparation is just text we dwell in a single domain.

In [85]:
import json

with open('data/recs.json') as data_file:    
    recs = json.load(data_file)

Since keys can be difficult to be found, including extra characters define **'levDist'** and **'levSearch'** to find them. They calculate the Levenshtein distance and return the closest (minimum edit distance).

In [200]:
def levDist(s1,s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

def levSearch(s1,dic):
    result = []
    count  = 99
    for key in dic.keys():
        keyn = " ".join(key[:-9].lower().split())
        s1n  = " ".join(s1.lower().split())
        #if len(keyn) == len(s1n):
        temp = levDist(s1n,keyn)
        if  temp < count:
            count  = temp
            result = key 
    return result
                

In [201]:
levSearch('cast      dhance      varvey',recs)

u'Last Chance Harvey 2008.txt'

Then define the **'recommender'** function. It takes the name of a script and recommends the closet or more similar scripts.

In [249]:
def recommender(string,dic,n=3):
    s1 = levSearch(string,dic)
    #print s1
    return dic[s1][:n]

In [206]:
recommender('SFbeat      dhance      varvey',recs)

Last Chance Harvey 2008.txt


[[u'Jack and Jill 2011.txt', 0.9963078564004061],
 [u'Barefoot 2014.txt', 0.9948440050605697],
 [u'Ricki and the Flash 2015.txt', 0.9930041706830336]]

## Summarize 

In order to know more about a script, it is always commendable to have a glimpse into it, to have basic information to understand it. Define the **'summarize'** class which summarizes and brings additional information about a script.

In [278]:
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest

class FrequencySummarizer:
    def __init__(self, min_cut=0.1, max_cut=0.9):
        self._min_cut = min_cut
        self._max_cut = max_cut 
        self._stopwords = set(stopwords.words('english') + list(punctuation))

    def _compute_frequencies(self,word_sent):
        freq = defaultdict(int)
        for s in word_sent:
            for word in s:
                if word not in self._stopwords:
                    freq[word] += 1
        m = float(max(freq.values()))
        for w in freq.keys():
            freq[w] = freq[w]/m
            if freq[w] >= self._max_cut or freq[w] <= self._min_cut:
                del freq[w]
        return freq

    def summarize(self,s1,n=5):
        s2 = levSearch(s1,recs)
        text = open('scrapped/'+s2,'r').readlines()[2]
        sents = sent_tokenize(text)
        assert n <= len(sents)
        word_sent = [word_tokenize(s.lower()) for s in sents]
        self._freq = self._compute_frequencies(word_sent)
        ranking = defaultdict(int)
        for i,sent in enumerate(word_sent):
            for w in sent:
                if w in self._freq:
                    ranking[i] += self._freq[w]
        sents_idx = self._rank(ranking, n)  
        print 'Title: ', df['Title'].ix[s2],
        print '\nRated: ', df['Rated'].ix[s2],
        print '\nRuntime: ', df['Runtime'].ix[s2],
        print '\nDirector: ', df['Director'].ix[s2],
        print '\nActors: ', df['Actors1'].ix[s2]+', '+df['Actors2'].ix[s2],
        print '\nGenres: ', df['Genre1'].ix[s2]+', '+df['Genre2'].ix[s2],
        print '\nRating: ', df['imdbRating'].ix[s2]
        print '\nPoster: ', df['Poster'].ix[s2]
        print '\nSummary:'
        temp = [sents[j] for j in sents_idx]
        for i in temp:
            print i
        print '\n'
        print 'Similar Titles:'
        temp = recommender(s1,recs,5)
        for i in temp:
            print i[0][:-8]
        print '\n'

    def _rank(self, ranking, n):
        return nlargest(n, ranking, key=ranking.get)

In [280]:
fs = FrequencySummarizer()
fs.summarize('beat     spay    blove',5)

Title:  Eat Pray Love 
Rated:  PG-13 
Runtime:  133.0 
Director:  Ryan Murphy 
Actors:  Julia Roberts,  I. Gusti Ayu Puspawati 
Genres:  Drama,  Romance 
Rating:  5.7

Poster:  https://images-na.ssl-images-amazon.com/images/M/MV5BMTY5NDkyNzkyM15BMl5BanBnXkFtZTcwNDQyNDk0Mw@@._V1_SX300.jpg

Summary:
I'm going to Italy and then l'm going to David's guru's ashram in lndia... ...and l'm going to end the year in Bali.
It's long, it's tedious, I can't keep up... ...and l get these insane anxieties about everything in my life... ...and l've lost my place.
And it was such a foreign concept to me, that l swear l almost began with: "l'm a big fan of your work."
-No, I don't even have my-- l-- You don't have your-- You don't-- You're so naked.
And l-- You know, I don't-- I don't know.


Similar Titles:
Me Again 
In The French Style 
Under the Greenwood Tree 
Far from the Madding Crowd 
Dames 




## Categorize

Use machine learning to assing a genre to the movies. Use the scaled dataset to test different models.

In [283]:
from sklearn.ensemble  import RandomForestClassifier as RFC
from sklearn.neighbors import KNeighborsClassifier as KNC
from sklearn.ensemble  import AdaBoostClassifier as ABC
from sklearn.svm       import SVC

def randFor(train,target,test,n,d,f):
    clf = RFC(n_estimators=n,max_depth=d,max_features=f)
    clf.fit(train,target)
    return clf.predict(test)

def kNearN(train,target,test,k,w):
    clf = KNC(n_neighbors=k,weights=w)
    clf.fit(train,target)
    return clf.predict(test)

def suppVC(train,target,test,k,g,c):
    clf = SVC(kernel=k,gamma=g,C=c)
    clf.fit(train,target)
    return clf.predict(test)

def aBoost(train,target,test,b,n,l,a):
    clf = ABC(base_estimator=b,n_estimators=n,learning_rate=l,algorithm=a)
    clf.fit(train,target)
    return clf.predict(test)

In [286]:
from sklearn.model_selection import cross_val_score as cvs

clfKN = KNR(n_neighbors=13,weights='uniform')
clfSV = SVR(kernel='rbf',C=5.0,gamma=0.21)
clfRF = RandomForestRegressor(n_estimators=200,max_depth=5,max_features='auto')

scoresKN = cross_val_score(clfKN,dfKK,targets,cv=5,scoring=mse_scorer)
scoresSV = cross_val_score(clfSV,dfNum,targets,cv=5,scoring=mse_scorer)
scoresRF = cross_val_score(clfRF,dfNum,targets,cv=5,scoring=mse_scorer)