# Predicción del score de la review

A) ¿Qué tan difícil es predecir el score de la review?

B) ¿Cuál es el modelo más sencillo que puede construirse con resultados aceptables?

C) ¿Cuál es la menor cantidad de datos que puede usarse para tener datos aceptables?

# Preprocesamiento

## Importo librerias

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
from matplotlib import pyplot as plt

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

## Leo datasets

In [3]:
df_reviews = pd.read_csv("dataset/silkroad_reviews.csv")
df_items = pd.read_csv("dataset/silkroad_items.csv")
df_shippings = pd.read_csv("dataset/silkroad_shippings.csv")

# Exploracion

### Analisis del target

Observo que mi target se encuentra en el dataset silkroad_reviews.csv en la columna rating.

In [4]:
df_reviews.head()

Unnamed: 0,item_id,date_created,comment,rating
0,250-grams-ghb-powder,2013-11-28,"fast shipping, great product",5.0
1,10g-washed-fishscale-cocaine,2013-11-28,Amazing product! Perfect stealth. Will def...,5.0
2,roche-valium-10mg,2013-11-28,"good vendor, posted express when paid for regu...",5.0
3,10g-washed-fishscale-cocaine,2013-11-28,great stuff,5.0
4,influence-the-psychology-of-persuasion,2013-11-28,Item received as advertised. Quick response. G...,5.0


In [5]:
df_reviews["rating"].value_counts()

5.0       297369
0.0        16837
50.0           1
1335.0         1
500.0          1
Name: rating, dtype: int64

Se observa que la mayoria de los valores de rating son 0 o 5. Por lo tanto, se considera valores distintos a 0 como outliers.

Ademas, como hay solo dos posibles clasificaciones, se considerara como un problema binario y encondeamos de acuerdo.

In [6]:
df_reviews["rating"].astype(int)

0         5
1         5
2         5
3         5
4         5
         ..
314204    0
314205    5
314206    5
314207    5
314208    5
Name: rating, Length: 314209, dtype: int64

In [7]:
def encode_target(df, column_name="rating"):
    df[column_name] = df[column_name]
    df[column_name] = df[column_name].apply(lambda x: x if(x==5.0 or x==0.0) else np.nan)
    df[column_name] = df[column_name].replace(5.0, 1.0)
    df = df.dropna(subset=[column_name])
    return df

## Fetures selection

Se va acrear un dataframe general donde se encontraran los features que se utilizaran para predecir lo buscado.

In [8]:
df_general = df_reviews.copy()
df_general = encode_target(df_reviews)
df_general.head()

Unnamed: 0,item_id,date_created,comment,rating
0,250-grams-ghb-powder,2013-11-28,"fast shipping, great product",1.0
1,10g-washed-fishscale-cocaine,2013-11-28,Amazing product! Perfect stealth. Will def...,1.0
2,roche-valium-10mg,2013-11-28,"good vendor, posted express when paid for regu...",1.0
3,10g-washed-fishscale-cocaine,2013-11-28,great stuff,1.0
4,influence-the-psychology-of-persuasion,2013-11-28,Item received as advertised. Quick response. G...,1.0


### Analisis dataset items.csv y shipping.csv

Ante la inmanejable cantidad de columnas duplicadas al generar un mergeo de este dataset con aquel en el que se encuentra el target, se opta por generar features a partir del id_del item para mergear con el general.

Para features:
- numericos: creo promedio, maximo, minimo, desviacion standard
- fecha: maximos dias consecutivos, cantidad por dia, por ano, por mes.
- categoricos: concatenacion

In [27]:
df_shippings.head()

Unnamed: 0,item_id,description,est_delivery,price,timestamp
0,jj-luna-privacy-ebooks,ebook link,1 day,0.0,2014-02-23T05:39:22
1,the-morality-of-capitalism,letter,4 days,0.001722,2014-02-23T05:39:29
2,survive,small book (domestic),4 days,0.017217,2014-02-23T05:39:33
3,a-non-religious-new-testament,media mail parcel (domestic),4 days,0.00687,2014-02-23T05:39:37
4,mindless-slogans-101-cheap-substitutes-for-act...,ebook link,1 day,0.0,2014-02-23T05:39:41


In [13]:
def features_items(df_items):

    df_items_to_merge = pd.DataFrame()
    
    df_items_group_by_price = df_items.groupby("item_id")["price"]
    df_items_to_merge["CF|item|price_mean"] = .mean()
    df_items_to_merge["CF|item|price_quantile50"] = df_items_group_by_price.quantile(0.5)
    df_items_to_merge["CF|item|price_quantile75"] = df_items_group_by_price.quantile(0.75)
    df_items_to_merge["CF|item|price_quantile25"] = df_items_group_by_price.quantile(0.25)
    df_items_to_merge["CF|item|price_count"] = df_items_group_by_price.count()
    df_items_to_merge["CF|item|price_max"] = df_items_group_by_price.max()
    df_items_to_merge["CF|item|price_min"] = df_items_group_by_price.min()
    df_items_to_merge["CF|item|price_std"] = df_items_group_by_price.std()

    df_items_to_merge["CF|item|original_path"] = df_items.groupby("item_id")["original_path"].apply(lambda x: ' '.join(x))

    df_items_to_merge["CF|item|vendor"] = df_items.groupby("item_id")["vendor"].apply(lambda x: ' '.join(x))

    df_items_to_merge["CF|item|title"] = df_items.groupby("item_id")["title"].apply(lambda x: ' '.join(x))

    df_items_to_merge["CF|item|category"] = df_items.groupby("item_id")["category"].apply(lambda x: ' '.join(x))
    
    return df_items_to_merge.reset_index()

In [14]:
df_items_to_merge = features_items(df_items)
df_items_to_merge.head()

Unnamed: 0,item_id,CF|item|price_mean,CF|item|price_max,CF|item|price_min,CF|item|price_std,CF|item|original_path,CF|item|vendor,CF|item|title,CF|item|category
0,0-001g-10g-lcd-digital-jewelry-diamond-pocket-...,0.084408,0.092069,0.076747,0.010834,2014-03-03/items/0-001g-10g-lcd-digital-jewelr...,sweshroom sweshroom,0.001g-10g LCD Digital Jewelry Diamond Pocket ...,Drug paraphernalia Drug paraphernalia
1,0-001g-20g-precision-measure-digital-milligram...,0.051429,0.051429,0.051429,,2014-02-11/items/0-001g-20g-precision-measure-...,sweshroom,0.001g 20g precision measure Digital Milligram...,Drug paraphernalia
2,0-01g-precision-scale-pocketsize-small-and-com...,0.105,0.105,0.105,,2014-10-15/items/0-01g-precision-scale-pockets...,sn4pp,0.01g Precision scale - Pocketsize (Small and ...,Alcohol
3,0-1-gr-ketamine-sample-only-one-sample-person,0.0,0.0,0.0,,2014-09-26/items/0-1-gr-ketamine-sample-only-o...,SRvendor,0.1 gr Ketamine Sample - ONLY one sample / PERSON,Dissociatives
4,0-1-gram-quality-ice-shard-express-post-shipping,0.213443,0.26104,0.174321,0.02663,2014-08-27/items/0-1-gram-quality-ice-shard-ex...,Twokeen Twokeen Twokeen Twokeen Twokeen Twokee...,0.1 gram Quality Ice Shard Express Post Shippi...,Stimulants Stimulants Stimulants Stimulants St...


In [15]:
def encode_est_delivery(x):
    if (x == "contact vendor"):
        return -1
    return int(x.split(" ")[0])
    
df_shippings["CF|est_delivery"] = df_shippings["est_delivery"].astype(str).apply(lambda x: encode_est_delivery(x))

In [16]:
def features_items(df_shippings):

    df_shippings_to_merge = pd.DataFrame()

    df_shippings_to_merge["CF|shippings|price_mean"] = df_shippings.groupby("item_id")["price"].mean()
    df_shippings_to_merge["CF|shippings|price_max"] = df_shippings.groupby("item_id")["price"].max()
    df_shippings_to_merge["CF|shippings|price_min"] = df_shippings.groupby("item_id")["price"].min()
    df_shippings_to_merge["CF|shippings|price_std"] = df_shippings.groupby("item_id")["price"].std()
    df_shippings_to_merge["CF|shippings|price_std"] = df_shippings.groupby("item_id")["price"].std()

    df_shippings_to_merge["CF|shippings|est_delivery_mean"] = df_shippings.groupby("item_id")["CF|est_delivery"].mean()
    df_shippings_to_merge["CF|shippings|est_delivery_max"] = df_shippings.groupby("item_id")["CF|est_delivery"].max()
    df_shippings_to_merge["CF|shippings|est_delivery_min"] = df_shippings.groupby("item_id")["CF|est_delivery"].min()
    df_shippings_to_merge["CF|shippings|est_delivery_std"] = df_shippings.groupby("item_id")["CF|est_delivery"].std()

    df_shippings["description"] = df_shippings["description"].astype(str)
    df_shippings_to_merge["CF|shippings|description"] = df_shippings.groupby("item_id")["description"].apply(lambda x: ' '.join(x))
    
    return df_shippings_to_merge.reset_index()
    
    

In [17]:
df_shippings_to_merge = features_items(df_shippings)
df_shippings_to_merge.head()

Unnamed: 0,item_id,CF|shippings|price_mean,CF|shippings|price_max,CF|shippings|price_min,CF|shippings|price_std,CF|shippings|est_delivery_mean,CF|shippings|est_delivery_max,CF|shippings|est_delivery_min,CF|shippings|est_delivery_std,CF|shippings|description
0,0-001g-10g-lcd-digital-jewelry-diamond-pocket-...,0.009397,0.017267,0.0,0.005732,6.285714,14,2,4.498677,Europe Worldwide Domestic (Sweden) Add Items t...
1,0-001g-20g-precision-measure-digital-milligram...,0.006327,0.01218,0.0,0.005139,7.25,14,2,5.377422,Add Items to a order you are about to place wi...
2,0-01g-precision-scale-pocketsize-small-and-com...,0.0,0.0,0.0,,1.0,1,1,,Free shipping
3,0-1-gr-ketamine-sample-only-one-sample-person,0.025629,0.025629,0.025629,0.0,7.0,7,7,0.0,FREE 0.1 gr Cocaine Sample - ONLY one sample /...
4,0-1-gram-quality-ice-shard-express-post-shipping,0.024029,0.029741,0.018636,0.003493,2.0,2,2,0.0,Express Post Shipping Express Post Shipping Ex...


In [18]:
df_general = df_general.merge(df_items_to_merge, how="left", on="item_id").merge(df_shippings_to_merge, how="left", on="item_id")

In [19]:
df_general.head()

Unnamed: 0,item_id,date_created,comment,rating,CF|item|price_mean,CF|item|price_max,CF|item|price_min,CF|item|price_std,CF|item|original_path,CF|item|vendor,...,CF|item|category,CF|shippings|price_mean,CF|shippings|price_max,CF|shippings|price_min,CF|shippings|price_std,CF|shippings|est_delivery_mean,CF|shippings|est_delivery_max,CF|shippings|est_delivery_min,CF|shippings|est_delivery_std,CF|shippings|description
0,250-grams-ghb-powder,2013-11-28,"fast shipping, great product",1.0,0.762496,1.214565,0.436762,0.182588,2013-12-20/items/250-grams-ghb-powder 2014-01-...,HoneyBee HoneyBee HoneyBee HoneyBee HoneyBee H...,...,GHB GHB GHB GHB GHB GHB GHB GHB GHB GHB GHB Dr...,0.0142,0.022239,0.008277,0.003453,4.0,4,4,0.0,USPS USPS USPS USPS USPS USPS USPS USPS USPS U...
1,10g-washed-fishscale-cocaine,2013-11-28,Amazing product! Perfect stealth. Will def...,1.0,0.019282,0.02908,0.012058,0.004771,2014-02-24/items/10g-washed-fishscale-cocaine_...,bcpltd bcpltd bcpltd bcpltd bcpltd bcpltd bcpl...,...,Stimulants Stimulants Stimulants Stimulants St...,0.0,0.0,0.0,,5.0,5,5,,Free Shipping
2,roche-valium-10mg,2013-11-28,"good vendor, posted express when paid for regu...",1.0,0.008892,0.011954,0.00667,0.001381,2014-02-24/items/roche-valium-10mg 2013-12-20/...,thetransporter thetransporter thetransporter t...,...,Prescription Prescription Prescription Prescri...,0.01485,0.034821,0.002988,0.011156,4.942857,7,3,2.02837,Regular Express Express Regular Express Reg...
3,10g-washed-fishscale-cocaine,2013-11-28,great stuff,1.0,0.019282,0.02908,0.012058,0.004771,2014-02-24/items/10g-washed-fishscale-cocaine_...,bcpltd bcpltd bcpltd bcpltd bcpltd bcpltd bcpl...,...,Stimulants Stimulants Stimulants Stimulants St...,0.0,0.0,0.0,,5.0,5,5,,Free Shipping
4,influence-the-psychology-of-persuasion,2013-11-28,Item received as advertised. Quick response. G...,1.0,0.002099,0.003146,0.001343,0.000458,2014-02-24/items/influence-the-psychology-of-p...,MrTerrific MrTerrific MrTerrific MrTerrific Mr...,...,Alcohol Alcohol Alcohol Alcohol Alcohol Alcoho...,0.0,0.0,0.0,,1.0,1,1,,Digital Download


# Exploracion de datos

In [12]:
def describe_categorical_features(df):
    categorical_features = df.columns[df.dtypes == object]
    for feature in categorical_features:
        print(feature + "---------------------------")
        print(df[feature].value_counts())
        print()

In [13]:
df_reviews.head()

Unnamed: 0,item_id,date_created,comment,rating
0,250-grams-ghb-powder,2013-11-28,"fast shipping, great product",5.0
1,10g-washed-fishscale-cocaine,2013-11-28,Amazing product! Perfect stealth. Will def...,5.0
2,roche-valium-10mg,2013-11-28,"good vendor, posted express when paid for regu...",5.0
3,10g-washed-fishscale-cocaine,2013-11-28,great stuff,5.0
4,influence-the-psychology-of-persuasion,2013-11-28,Item received as advertised. Quick response. G...,5.0


In [14]:
describe_categorical_features(df_reviews)

item_id---------------------------
5g-white-widow-dutchmagic                                                                                   1464
ketamine-1g-c64897dc-9ab4-43e6-8d93-a9e38dccc8e4                                                            1189
liquid-mushrooms-pure-psilocybin-no-nausea-faster-trip-cleaner-feel-than-dried-shrooms-click-for-details    1118
1-0-gram-pure-columbian-flake-high-quality-cocaine-meerkovo                                                 1022
1g-platinum-standard-pure-fire-mdma                                                                          993
                                                                                                            ... 
3-lsd-blotters-110-g-apningstilbud                                                                             1
10g-northern-lights-a-coffeshop-quality-weed                                                                   1
10-x-mdma-capsules-125mg-90-2-new-vendor-special-200-with-fre

In [16]:
lm = WordNetLemmatizer()
def text_transformation(df_col):
    corpus = []
    for item in df_col:
        new_item = re.sub('[^a-zA-Z]',' ',str(item))
        new_item = new_item.lower()
        new_item = new_item.split()
        new_item = [lm.lemmatize(word) for word in new_item if word not in set(stopwords.words('english'))]
        corpus.append(' '.join(str(x) for x in new_item))
    return corpus

In [17]:
corpus = text_transformation(df_reviews['comment'])

KeyboardInterrupt: 

In [None]:
df_reviews["NF|npl"] = corpus

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_encoding(df, feature_name, prefix, min_df=0.01, df_test= None, isTest = False):
    train = df[feature_name].fillna("")
    tfidfvectorizer = TfidfVectorizer(analyzer='word',stop_words= 'english', min_df=min_df)
    tfidf_wm = tfidfvectorizer.fit_transform(train)
    tfidf_tokens = tfidfvectorizer.get_feature_names()
    if(isTest):
        test = df_test[feature_name].fillna("")
        tfidf_wm = tfidfvectorizer.transform(test)
    df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(), columns = tfidf_tokens)
    df_tfidfvect.columns = ["NF|TFIDF|" + prefix + "|" + x for x in df_tfidfvect.columns]
    return df_tfidfvect

In [None]:
df_review_tfidf = tfidf_encoding(df_reviews, "NF|npl", "comment", min_df=0.01, df_test= None, isTest = False)

In [None]:
df_review_tfidf

In [None]:
# FEATURE: est_delivery

def encode_est_delivery(x):
    if (x == "contact vendor"):
        return -1
    return int(x.split(" ")[0])
    

df_reviews["CUSTOM_est_delivery"] = df_reviews["est_delivery"].astype(str).apply(lambda x: encode_est_delivery(x))