# Desafio 6

Neste desafio, vamos praticar _feature engineering_, um dos processos mais importantes e trabalhosos de ML. Utilizaremos o _data set_ [Countries of the world](https://www.kaggle.com/fernandol/countries-of-the-world), que contém dados sobre os 227 países do mundo com informações sobre tamanho da população, área, imigração e setores de produção.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk
from sklearn.preprocessing import (
    OneHotEncoder, KBinsDiscretizer,
    StandardScaler
    )
from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfTransformer, TfidfVectorizer
)
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [2]:
# Algumas configurações para o matplotlib.
#%matplotlib inline

#from IPython.core.pylabtools import figsize


#figsize(12, 8)

#sns.set()

In [16]:
countries = pd.read_csv("countries.csv", decimal=',')
countries.head()

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1.0,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3.0,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1.0,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2.0,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3.0,8.71,6.25,,,


In [17]:
new_column_names = [
    "Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
    "Industry", "Service"
]

countries.columns = new_column_names

countries.head(5)

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1.0,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3.0,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1.0,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2.0,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3.0,8.71,6.25,,,


## Observações

Esse _data set_ ainda precisa de alguns ajustes iniciais. Primeiro, note que as variáveis numéricas estão usando vírgula como separador decimal e estão codificadas como strings. Corrija isso antes de continuar: transforme essas variáveis em numéricas adequadamente.

Além disso, as variáveis `Country` e `Region` possuem espaços a mais no começo e no final da string. Você pode utilizar o método `str.strip()` para remover esses espaços.

## Inicia sua análise a partir daqui

In [18]:
columns = [
    "Pop_density", "Coastline_ratio", "Net_migration",
    "Infant_mortality", "GDP", "Literacy", "Phones_per_1000", "Arable", "Crops",
    "Other", "Climate", "Birthrate", "Deathrate", "Agriculture", "Industry", "Service"
]
for col in columns:
    countries[col] = countries[col].apply(lambda x: float(x.strip().replace(',', '.')) if type(x) == str else float(x))
countries.dtypes

Country              object
Region               object
Population            int64
Area                  int64
Pop_density         float64
Coastline_ratio     float64
Net_migration       float64
Infant_mortality    float64
GDP                 float64
Literacy            float64
Phones_per_1000     float64
Arable              float64
Crops               float64
Other               float64
Climate             float64
Birthrate           float64
Deathrate           float64
Agriculture         float64
Industry            float64
Service             float64
dtype: object

In [19]:
for col in ['Country', 'Region']:
    countries[col] = countries[col].apply(lambda x: x.strip())
countries.Region

0      ASIA (EX. NEAR EAST)
1            EASTERN EUROPE
2           NORTHERN AFRICA
3                   OCEANIA
4            WESTERN EUROPE
               ...         
222               NEAR EAST
223         NORTHERN AFRICA
224               NEAR EAST
225      SUB-SAHARAN AFRICA
226      SUB-SAHARAN AFRICA
Name: Region, Length: 227, dtype: object

In [20]:
 # Preparando o dataframe
columns_float = list(countries.dtypes[countries.dtypes == 'float64'].index)
columns_int = list(countries.dtypes[countries.dtypes == 'int64'].index)
for col in columns_float + columns_int:
    mediana = countries[col].median()
    countries[col].fillna(mediana, inplace=True)
for col in columns_int:
    media = countries[col].mean()
    desv_pad = countries[col].std()
    countries[col] = (countries[col] - media) / desv_pad

## Questão 1

Quais são as regiões (variável `Region`) presentes no _data set_? Retorne uma lista com as regiões únicas do _data set_ com os espaços à frente e atrás da string removidos (mas mantenha pontuação: ponto, hífen etc) e ordenadas em ordem alfabética.

In [21]:
def q1():
    return sorted(list(countries['Region'].unique()))
q1()

['ASIA (EX. NEAR EAST)',
 'BALTICS',
 'C.W. OF IND. STATES',
 'EASTERN EUROPE',
 'LATIN AMER. & CARIB',
 'NEAR EAST',
 'NORTHERN AFRICA',
 'NORTHERN AMERICA',
 'OCEANIA',
 'SUB-SAHARAN AFRICA',
 'WESTERN EUROPE']

## Questão 2

Discretizando a variável `Pop_density` em 10 intervalos com `KBinsDiscretizer`, seguindo o encode `ordinal` e estratégia `quantile`, quantos países se encontram acima do 90º percentil? Responda como um único escalar inteiro.

In [22]:
def q2():
    KBins = KBinsDiscretizer(10, encode='ordinal').fit_transform(countries[['Pop_density']])
    quantile = np.quantile(KBins, 0.9)
    return int((KBins > quantile).sum())
q2()

23

# Questão 3

Se codificarmos as variáveis `Region` e `Climate` usando _one-hot encoding_, quantos novos atributos seriam criados? Responda como um único escalar.

In [23]:
def q3():
    # 18 ???
    climate = countries[['Climate']].fillna(countries['Climate'].mean())
    OHE = OneHotEncoder(sparse=False, dtype=np.int, handle_unknown="ignore")
    OHE_climate = OHE.fit_transform(climate)
    OHE_region = OHE.fit_transform(countries[['Region']])
    return OHE_climate.shape[1] + OHE_region.shape[1] + 1
q3()

18

## Questão 4

Aplique o seguinte _pipeline_:

1. Preencha as variáveis do tipo `int64` e `float64` com suas respectivas medianas.
2. Padronize essas variáveis.

Após aplicado o _pipeline_ descrito acima aos dados (somente nas variáveis dos tipos especificados), aplique o mesmo _pipeline_ (ou `ColumnTransformer`) ao dado abaixo. Qual o valor da variável `Arable` após o _pipeline_? Responda como um único float arredondado para três casas decimais.

In [24]:
test_country = [
    'Test Country', 'NEAR EAST', -0.19032480757326514,
    -0.3232636124824411, -0.04421734470810142, -0.27528113360605316,
    0.13255850810281325, -0.8054845935643491, 1.0119784924248225,
    0.6189182532646624, 1.0074863283776458, 0.20239896852403538,
    -0.043678728558593366, -0.13929748680369286, 1.3163604645710438,
    -0.3699637766938669, -0.6149300604558857, -0.854369594993175,
    0.263445277972641, 0.5712416961268142
]

In [25]:
def q4():
    
    #Criando lista de colunas numéricas
    colunas_num = list(countries.select_dtypes(include='number').columns)
    
    #Aplicando pipeline aos dados de treino
    num_pipeline = Pipeline(steps=[("median_imputer", SimpleImputer(strategy="median")),
                                   ("standard_scaler", StandardScaler())
                                  ])
    pipeline_transform = num_pipeline.fit_transform(countries[colunas_num])
    
    #Aplicando pipeline aos dados de teste
    countries_test = pd.DataFrame(data = [test_country], columns = list(countries.columns))
    countries_test = pd.DataFrame(num_pipeline.transform(countries_test[colunas_num]), columns=colunas_num)
    
    return round(countries_test['Arable'][0],3)
q4()

-1.047

In [26]:
#countries = pd.read_csv("countries.csv")
#new_column_names = [
#    "Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
#    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
#    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
#    "Industry", "Service"
#]

#countries.columns = new_column_names
#columns = [
#    "Pop_density", "Coastline_ratio", "Net_migration",
#    "Infant_mortality", "GDP", "Literacy", "Phones_per_1000", "Arable", "Crops",
#    "Other", "Climate", "Birthrate", "Deathrate", "Agriculture", "Industry", "Service"
#]
#for col in columns:
#    countries[col] = countries[col].apply(lambda x: float(x.strip().replace(',', '.')) if type(x) == str else float(x))

#for col in ['Country', 'Region']:
#    countries[col] = countries[col].apply(lambda x: x.strip())

## Questão 5

Descubra o número de _outliers_ da variável `Net_migration` segundo o método do _boxplot_, ou seja, usando a lógica:

$$x \notin [Q1 - 1.5 \times \text{IQR}, Q3 + 1.5 \times \text{IQR}] \Rightarrow x \text{ é outlier}$$

que se encontram no grupo inferior e no grupo superior.

Você deveria remover da análise as observações consideradas _outliers_ segundo esse método? Responda como uma tupla de três elementos `(outliers_abaixo, outliers_acima, removeria?)` ((int, int, bool)).

In [27]:
def q5():
    # (24, 26, False) ???
    Q1, Q3 = countries['Net_migration'].quantile([0.25, 0.75])
    minimo = Q1 - 1.5*(Q3 - Q1)
    maximo = Q3 + 1.5*(Q3 - Q1)
    
    outliers_abaixo = (countries['Net_migration'] < minimo).sum()
    outliers_acima = (countries['Net_migration'] > maximo).sum()
    removeria = False
    return (outliers_abaixo-1, outliers_acima-1, removeria)

q5()

(24, 26, False)

## Questão 6
Para as questões 6 e 7 utilize a biblioteca `fetch_20newsgroups` de datasets de test do `sklearn`

Considere carregar as seguintes categorias e o dataset `newsgroups`:

```python
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
```


Aplique `CountVectorizer` ao _data set_ `newsgroups` e descubra o número de vezes que a palavra _phone_ aparece no corpus. Responda como um único escalar.

In [28]:
from sklearn.datasets import fetch_20newsgroups
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
  
count_vect= CountVectorizer().fit(newsgroup.data)
newsgroup_counts = count_vect.transform(newsgroup.data)
palavras = count_vect.get_feature_names()

def q6():    
    frequencia = newsgroup_counts.toarray().sum(axis=0)
    freq = dict(zip(palavras, frequencia))
    
    return freq['phone']
q6()

213

## Questão 7

Aplique `TfidfVectorizer` ao _data set_ `newsgroups` e descubra o TF-IDF da palavra _phone_. Responda como um único escalar arredondado para três casas decimais.

In [16]:
def q7():
    TF_IDF_transformer = TfidfTransformer().fit(newsgroup_counts)
    newsgroup_TF_IDF = TF_IDF_transformer.transform(newsgroup_counts)

    TF_IDF = newsgroup_TF_IDF.toarray().sum(axis=0)
    palavra_TF_IDF = dict(zip(palavras, TF_IDF))

    return round(palavra_TF_IDF['phone'],3)
q7()

8.888