# Desafio 6

Neste desafio, vamos praticar _feature engineering_, um dos processos mais importantes e trabalhosos de ML. Utilizaremos o _data set_ [Countries of the world](https://www.kaggle.com/fernandol/countries-of-the-world), que contém dados sobre os 227 países do mundo com informações sobre tamanho da população, área, imigração e setores de produção.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk

In [3]:
# Algumas configurações para o matplotlib.
# %matplotlib inline

from IPython.core.pylabtools import figsize


figsize(12, 8)

sns.set()

In [4]:
countries = pd.read_csv("countries of the world.csv")

In [5]:
new_column_names = [
    "Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
    "Industry", "Service"
]

countries.columns = new_column_names

countries.head(5)

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


In [114]:
countries.info

<bound method DataFrame.info of             Country                Region  Population     Area  Pop_density  \
0       Afghanistan  ASIA (EX. NEAR EAST)    31056997   647500         48.0   
1           Albania        EASTERN EUROPE     3581655    28748        124.6   
2           Algeria       NORTHERN AFRICA    32930091  2381740         13.8   
3    American Samoa               OCEANIA       57794      199        290.4   
4           Andorra        WESTERN EUROPE       71201      468        152.1   
..              ...                   ...         ...      ...          ...   
222       West Bank             NEAR EAST     2460492     5860        419.9   
223  Western Sahara       NORTHERN AFRICA      273008   266000          1.0   
224           Yemen             NEAR EAST    21456188   527970         40.6   
225          Zambia    SUB-SAHARAN AFRICA    11502010   752614         15.3   
226        Zimbabwe    SUB-SAHARAN AFRICA    12236805   390580         31.3   

     Coastline_rati

## Observações

Esse _data set_ ainda precisa de alguns ajustes iniciais. Primeiro, note que as variáveis numéricas estão usando vírgula como separador decimal e estão codificadas como strings. Corrija isso antes de continuar: transforme essas variáveis em numéricas adequadamente.

Além disso, as variáveis `Country` e `Region` possuem espaços a mais no começo e no final da string. Você pode utilizar o método `str.strip()` para remover esses espaços.

## Inicia sua análise a partir daqui

In [6]:
num_cols=['Population','Area','Pop_density','Coastline_ratio','Net_migration',"Infant_mortality",'GDP','Literacy','Phones_per_1000','Arable','Crops','Other','Climate','Birthrate','Deathrate','Agriculture','Industry','Service']

for C in num_cols:
    countries[C]=countries[C].apply(lambda x: str(x).replace(',','.'))
    countries[C]=countries[C].astype(np.float)

In [7]:
countries.Country=countries.Country.apply(lambda x: str(x).strip())
countries.Region=countries.Region.apply(lambda x: str(x).strip())

In [8]:
countries.shape

(227, 20)

In [51]:
countries.isna().sum()

Country              0
Region               0
Population           0
Area                 0
Pop_density          0
Coastline_ratio      0
Net_migration        0
Infant_mortality     3
GDP                  1
Literacy            18
Phones_per_1000      4
Arable               2
Crops                2
Other                2
Climate             22
Birthrate            3
Deathrate            4
Agriculture         15
Industry            16
Service             15
dtype: int64

## Questão 1

Quais são as regiões (variável `Region`) presentes no _data set_? Retorne uma lista com as regiões únicas do _data set_ com os espaços à frente e atrás da string removidos (mas mantenha pontuação: ponto, hífen etc) e ordenadas em ordem alfabética.

In [6]:
def q1():
    # Retorne aqui o resultado da questão 1.
    a=list(countries.Region.unique())
    a.sort()
    return a

## Questão 2

Discretizando a variável `Pop_density` em 10 intervalos com `KBinsDiscretizer`, seguindo o encode `ordinal` e estratégia `quantile`, quantos países se encontram acima do 90º percentil? Responda como um único escalar inteiro.

In [93]:
from sklearn.preprocessing import (
    OneHotEncoder, Binarizer, KBinsDiscretizer,
    MinMaxScaler, StandardScaler, PolynomialFeatures
)

In [94]:
discretizer = KBinsDiscretizer(n_bins=10,encode='ordinal',strategy='quantile')
discretizer.fit(countries[["Pop_density"]])

KBinsDiscretizer(encode='ordinal', n_bins=10, strategy='quantile')

In [95]:
discretizer.bin_edges_[0]

array([0.00000e+00, 1.01400e+01, 2.12200e+01, 3.94800e+01, 5.98000e+01,
       7.88000e+01, 1.05540e+02, 1.50120e+02, 2.53700e+02, 3.96740e+02,
       1.62715e+04])

In [102]:
def q2():
    return countries["Pop_density"][countries.Pop_density>=discretizer.bin_edges_[0][9]].count()

# Questão 3

Se codificarmos as variáveis `Region` e `Climate` usando _one-hot encoding_, quantos novos atributos seriam criados? Responda como um único escalar.

In [101]:
def q3():
    reg_at=countries.Region.nunique()
    Climate_at=countries.Climate.nunique()
    return reg_at+Climate_at

## Questão 4

Aplique o seguinte _pipeline_:

1. Preencha as variáveis do tipo `int64` e `float64` com suas respectivas medianas.
2. Padronize essas variáveis.

Após aplicado o _pipeline_ descrito acima aos dados (somente nas variáveis dos tipos especificados), aplique o mesmo _pipeline_ (ou `ColumnTransformer`) ao dado abaixo. Qual o valor da variável `Arable` após o _pipeline_? Responda como um único float arredondado para três casas decimais.

In [127]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [139]:
df_pad=countries[["Country","Region"]]

In [128]:
num_pipeline = Pipeline(steps=[
    ("imputer",SimpleImputer(strategy='median')),
    ("stardizer", StandardScaler())
])

In [129]:
countries_pipetransf= num_pipeline.fit_transform(countries[num_cols])

countries_pipetransf[:10]

array([[ 1.96946842e-02,  2.75833238e-02, -1.99844336e-01,
        -2.93443424e-01,  4.75079803e+00,  3.63809820e+00,
        -8.96394232e-01, -2.49781686e+00, -1.02749132e+00,
        -1.26360817e-01, -5.18861111e-01,  3.72601687e-01,
        -1.69435818e+00,  2.21296666e+00,  2.25250740e+00,
         1.63657562e+00, -3.15405761e-01, -1.16113540e+00],
       [-2.13876876e-01, -3.18797484e-01, -1.53602957e-01,
        -2.75974351e-01, -1.02509671e+00, -3.92849919e-01,
        -5.16717983e-01,  1.51932689e-01, -7.26078820e-01,
         5.65115164e-01, -1.36038942e-02, -4.47933408e-01,
         1.31636046e+00, -6.27986167e-01, -8.09332600e-01,
         5.95163494e-01, -7.06318151e-01,  8.34243202e-02],
       [ 3.56181070e-02,  9.98420512e-01, -2.20489965e-01,
        -2.92888851e-01, -8.82420407e-02, -1.22886032e-01,
        -3.66845780e-01, -7.13827065e-01, -6.95494316e-01,
        -8.13978114e-01, -5.15252131e-01,  9.26275825e-01,
        -1.69435818e+00, -4.44844401e-01, -9.32859214e

In [140]:
countries_pad=pd.DataFrame(countries_pipetransf,columns=num_cols)
df_pad=pd.concat([df_pad,countries_pad],axis=1)

In [141]:
df_pad

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),0.019695,0.027583,-0.199844,-0.293443,4.750798,3.638098,-0.896394,-2.497817,-1.027491,-0.126361,-0.518861,0.372602,-1.694358,2.212967,2.252507,1.636576,-0.315406,-1.161135
1,Albania,EASTERN EUROPE,-0.213877,-0.318797,-0.153603,-0.275974,-1.025097,-0.392850,-0.516718,0.151933,-0.726079,0.565115,-0.013604,-0.447933,1.316360,-0.627986,-0.809333,0.595163,-0.706318,0.083424
2,Algeria,NORTHERN AFRICA,0.035618,0.998421,-0.220490,-0.292889,-0.088242,-0.122886,-0.366846,-0.713827,-0.695494,-0.813978,-0.515252,0.926276,-1.694358,-0.444844,-0.932859,-0.326627,2.390911,-1.673969
3,American Samoa,OCEANIA,-0.243834,-0.334779,-0.053514,0.514709,-4.281389,-0.741696,-0.167016,0.702871,0.108568,-0.290741,1.259163,-0.416135,-0.188999,0.035113,-1.204213,-0.340700,-0.074844,0.033392
4,Andorra,WESTERN EUROPE,-0.243720,-0.334629,-0.137002,-0.293443,1.354184,-0.890347,0.932047,0.860282,1.162182,-0.891152,-0.545327,1.004214,1.316360,-1.205379,-0.600755,-0.340700,-0.074844,0.033392
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
222,West Bank,NEAR EAST,-0.223408,-0.331610,0.024662,-0.293443,0.607177,-0.446957,-0.886403,0.466754,-0.398071,0.241758,1.736751,-1.093887,1.316360,0.866018,-1.072586,-0.404029,-0.014704,0.402382
223,Western Sahara,NORTHERN AFRICA,-0.242004,-0.185982,-0.228217,-0.287620,-0.007763,-0.407658,-0.411807,0.466754,-0.260662,-1.060934,-0.545327,1.141386,-1.694358,-0.295985,-0.278776,-0.340700,-0.074844,-1.036054
224,Yemen,NEAR EAST,-0.061923,-0.039330,-0.204312,-0.288452,-0.007763,0.745669,-0.886403,-1.752739,-0.876785,-0.847935,-0.516455,0.954334,-1.694358,1.878259,-0.185624,-0.087384,1.428665,-1.079832
225,Zambia,SUB-SAHARAN AFRICA,-0.146545,0.086427,-0.219584,-0.293443,-0.007763,1.508573,-0.886403,-0.157642,-1.005329,-0.516088,-0.541718,0.699943,-0.188999,1.707748,2.169481,0.510725,0.060472,-0.479442


In [10]:
test_country = [
    'Test Country', 'NEAR EAST', -0.19032480757326514,
    -0.3232636124824411, -0.04421734470810142, -0.27528113360605316,
    0.13255850810281325, -0.8054845935643491, 1.0119784924248225,
    0.6189182532646624, 1.0074863283776458, 0.20239896852403538,
    -0.043678728558593366, -0.13929748680369286, 1.3163604645710438,
    -0.3699637766938669, -0.6149300604558857, -0.854369594993175,
    0.263445277972641, 0.5712416961268142
]

In [142]:
def q4():
    return test_country[11]

## Questão 5

Descubra o número de _outliers_ da variável `Net_migration` segundo o método do _boxplot_, ou seja, usando a lógica:

$$x \notin [Q1 - 1.5 \times \text{IQR}, Q3 + 1.5 \times \text{IQR}] \Rightarrow x \text{ é outlier}$$

que se encontram no grupo inferior e no grupo superior.

Você deveria remover da análise as observações consideradas _outliers_ segundo esse método? Responda como uma tupla de três elementos `(outliers_abaixo, outliers_acima, removeria?)` ((int, int, bool)).

In [9]:
quant1=countries.Net_migration.quantile(0.25)
quant3=countries.Net_migration.quantile(0.75)
IQR=quant3-quant1

In [14]:
outliers_abaixo=countries['Net_migration'][countries.Net_migration<quant1-1.5*IQR].count()
outliers_acima=countries['Net_migration'][countries.Net_migration>quant3+1.5*IQR].count()

In [16]:
outliers_abaixo

24

In [161]:
def q5():
    return tuple([outliers_abaixo,outliers_acima,False])

## Questão 6
Para as questões 6 e 7 utilize a biblioteca `fetch_20newsgroups` de datasets de test do `sklearn`

Considere carregar as seguintes categorias e o dataset `newsgroups`:

```
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
```


Aplique `CountVectorizer` ao _data set_ `newsgroups` e descubra o número de vezes que a palavra _phone_ aparece no corpus. Responda como um único escalar.

In [18]:
from sklearn.datasets import fetch_20newsgroups

In [19]:
from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfTransformer, TfidfVectorizer
)

In [20]:
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)

In [21]:
cv=CountVectorizer()
newgroups_counts=cv.fit_transform(newsgroup.data)

In [22]:
freqs=pd.DataFrame(newgroups_counts.toarray(),columns=np.array(cv.get_feature_names()))

In [23]:
def q6():
    return freqs['phone'].sum()

## Questão 7

Aplique `TfidfVectorizer` ao _data set_ `newsgroups` e descubra o TF-IDF da palavra _phone_. Responda como um único escalar arredondado para três casas decimais.

In [24]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(newsgroup.data)
newsgroups_tfidf_vectorized = tfidf_vectorizer.transform(newsgroup.data)

In [25]:
newsgroups_tfidf=pd.DataFrame(newsgroups_tfidf_vectorized.toarray(),columns=np.array(cv.get_feature_names()))

In [208]:
def q7():
    return float(round(newsgroups_tfidf['phone'].sum(),3))