# Modelos Analíticos para Ciencia de Datos Parte III. ML Supervisado

## Objetivos

1. Ver un ejemplo de sistemas de recomendación simple.
2. Ver un ejemplo de modelo uso de algoritmos de agrupamiento K-Means.

### Sistema de Recomendación Simple

In [62]:
# Importar la libreria de Pandas:
import pandas as pd

# Cargar archivo de datos:
metadata = pd.read_csv(
    "D://Coderhouse//9. Modelos Analíticos para Ciencia de Datos//movies_metadata.csv",
    low_memory=False,
)

# Inspeccionar las tres primeras columnas:
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


### Ecuación de evaluación

$$WeightedRating(WR) = (\frac{v}{v+m}*R) +(\frac{m}{v+m}*C)$$

los términos son:

v = numero de votos por la pelicula;

m = numero mínimo de votos para estar incluido en el listado;

R = el rating promedio de la película;

C = la media del rating votado a lo largo del conjunto de datos.

In [63]:
# Calcular la media del promedio de votos:
C = metadata["vote_average"].mean()
print(C)

5.618207215134185


In [64]:
# Calcular el minimo de votos, m:
m = metadata["vote_count"].quantile(0.90)
print(m)

160.0


In [65]:
# Crear un nuevo Dataframe con los datos:
q_movies = metadata.copy().loc[metadata["vote_count"] >= m]
q_movies.shape

(4555, 24)

In [66]:
# Inspección del tamaño original:
metadata.shape

(45466, 24)

In [67]:
# Definimos la función para crear el WR:
def weighted_rating(x, m=m, C=C):
    v = x["vote_count"]
    R = x["vote_average"]
    # Calculos basados en la fórmula de IMDB
    return (v / (v + m) * R) + (m / (m + v) * C)

In [68]:
# Definimos un nuevo feature 'score' y calculamos su valor con la función `weighted_rating()`:
q_movies["score"] = q_movies.apply(weighted_rating, axis=1)

In [69]:
# Ordenamos los datos según el 'score' obtenido:
q_movies = q_movies.sort_values("score", ascending=False)

# Imprimimos los primeros 20 registros:
q_movies[["title", "vote_count", "vote_average", "score"]].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


### Agrupacion con K-Means

In [70]:
# Importamos las librerias necesarias:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [71]:
# Cargamos los conjuntos de entrenamiento y prueba de nuestro set de datos:

train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

In [72]:
print("***** Train_Set *****")
print(train.head())
print("\n")
print("***** Test_Set *****")
print(test.head())

***** Train_Set *****
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.05

In [73]:
# Analizamos los datos:
print("***** Train_Set *****")
print(train.describe())
print("\n")
print("***** Test_Set *****")
print(test.describe())

***** Train_Set *****
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


***** Test_Set *****
       PassengerId      Pclass      

In [74]:
# Analizamos valores faltantes:
print("*****In the train set*****")
print(train.isna().sum())
print("\n")
print("*****In the test set*****")
print(test.isna().sum())

*****In the train set*****
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


*****In the test set*****
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [75]:
# Imputamos los datos faltantes en el conjunto de entrenamiento:
train["Age"].fillna(train["Age"].mean(), inplace=True)

# Imputamos los datos faltantes en el conjunto de prueba:
test["Age"].fillna(test["Age"].mean(), inplace=True)

In [76]:
print(test.isna().sum())

PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [77]:
train["Cabin"].head()

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

In [78]:
train = train.drop(["Name", "Ticket", "Cabin", "Embarked"], axis=1)
test = test.drop(["Name", "Ticket", "Cabin", "Embarked"], axis=1)

In [79]:
labelEncoder = LabelEncoder()
labelEncoder.fit(train["Sex"])
labelEncoder.fit(test["Sex"])
train["Sex"] = labelEncoder.transform(train["Sex"])
test["Sex"] = labelEncoder.transform(test["Sex"])

In [80]:
X = np.array(train.drop(["Survived"], axis=1).astype(float))
y = np.array(train["Survived"])

In [81]:
kmeans = KMeans(n_clusters=2)  # Buscamos grupos (1: supervicencia)
kmeans.fit(X)



In [82]:
correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

print(
    correct / len(X)
)  # Imprimimos la cantidad de predicciones correctas divididas por el total de registros

0.49158249158249157


In [83]:
kmeans = kmeans = KMeans(n_clusters=2, max_iter=600, algorithm="auto")
kmeans.fit(X)



In [84]:
correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

print(correct / len(X))

0.5084175084175084


In [85]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [86]:
kmeans.fit(X_scaled)



In [87]:
correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

print(correct / len(X))

0.37373737373737376
