# Desafio 6

Neste desafio, vamos praticar _feature engineering_, um dos processos mais importantes e trabalhosos de ML. Utilizaremos o _data set_ [Countries of the world](https://www.kaggle.com/fernandol/countries-of-the-world), que contém dados sobre os 227 países do mundo com informações sobre tamanho da população, área, imigração e setores de produção.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk

In [2]:
from sklearn.preprocessing import (
    OneHotEncoder, Binarizer, KBinsDiscretizer,
    MinMaxScaler, StandardScaler, PolynomialFeatures
)
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_digits, fetch_20newsgroups
from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfTransformer, TfidfVectorizer
)

In [3]:
# Algumas configurações para o matplotlib.
#%matplotlib inline

from IPython.core.pylabtools import figsize


figsize(12, 8)

sns.set()

In [4]:
countries = pd.read_csv("data/countries.csv")

In [5]:
new_column_names = [
    "Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
    "Industry", "Service"
]

countries.columns = new_column_names

countries.head()

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


## Observações

Esse _data set_ ainda precisa de alguns ajustes iniciais. Primeiro, note que as variáveis numéricas estão usando vírgula como separador decimal e estão codificadas como strings. Corrija isso antes de continuar: transforme essas variáveis em numéricas adequadamente.

Além disso, as variáveis `Country` e `Region` possuem espaços a mais no começo e no final da string. Você pode utilizar o método `str.strip()` para remover esses espaços.

## Inicia sua análise a partir daqui

In [6]:
# Sua análise começa aqui.
countries['Country'] = countries['Country'].str.strip()
countries['Region'] = countries['Region'].str.strip()
countries.head()

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


In [7]:
countries.head().T

Unnamed: 0,0,1,2,3,4
Country,Afghanistan,Albania,Algeria,American Samoa,Andorra
Region,ASIA (EX. NEAR EAST),EASTERN EUROPE,NORTHERN AFRICA,OCEANIA,WESTERN EUROPE
Population,31056997,3581655,32930091,57794,71201
Area,647500,28748,2381740,199,468
Pop_density,480,1246,138,2904,1521
Coastline_ratio,000,126,004,5829,000
Net_migration,2306,-493,-039,-2071,66
Infant_mortality,16307,2152,31,927,405
GDP,700,4500,6000,8000,19000
Literacy,360,865,700,970,1000


In [8]:
variavel_float = ["Pop_density", "Coastline_ratio", "Net_migration", "Infant_mortality", "Literacy", 
                  "Phones_per_1000", "Arable", "Crops", "Other", "Birthrate", "Deathrate", "Agriculture",
                  "Industry", "Service"]

for coluna in variavel_float:
    countries[coluna] = countries[coluna].replace(regex='\,', value='.')
    countries[coluna] = countries[coluna].astype(float)

## Questão 1

Quais são as regiões (variável `Region`) presentes no _data set_? Retorne uma lista com as regiões únicas do _data set_ com os espaços à frente e atrás da string removidos (mas mantenha pontuação: ponto, hífen etc) e ordenadas em ordem alfabética.

In [9]:
def q1():
    regioes = np.sort(countries['Region'].unique())
    return list(regioes)

q1()

['ASIA (EX. NEAR EAST)',
 'BALTICS',
 'C.W. OF IND. STATES',
 'EASTERN EUROPE',
 'LATIN AMER. & CARIB',
 'NEAR EAST',
 'NORTHERN AFRICA',
 'NORTHERN AMERICA',
 'OCEANIA',
 'SUB-SAHARAN AFRICA',
 'WESTERN EUROPE']

## Questão 2

Discretizando a variável `Pop_density` em 10 intervalos com `KBinsDiscretizer`, seguindo o encode `ordinal` e estratégia `quantile`, quantos países se encontram acima do 90º percentil? Responda como um único escalar inteiro.

In [10]:
def q2():
    discretizar = KBinsDiscretizer(n_bins=10, encode='ordinal',  strategy='quantile')
    discretizar.fit(countries[['Pop_density']])
    resposta = discretizar.transform(countries[['Pop_density']])
    resposta = sum(resposta[:, 0] == 9)
    return int(resposta)
    
q2()

23

# Questão 3

Se codificarmos as variáveis `Region` e `Climate` usando _one-hot encoding_, quantos novos atributos seriam criados? Responda como um único escalar.

In [11]:
def q3():
    regioes = np.sort(countries['Region'].unique())
    valor = len(regioes)
    clima = countries['Climate'].unique()
    valor += len(clima)
    return valor

q3()

18

In [12]:
copia_country = countries.copy()

In [13]:
numeric_features = copia_country.select_dtypes(include=[np.number])

numeric_features.columns

Index(['Population', 'Area', 'Pop_density', 'Coastline_ratio', 'Net_migration',
       'Infant_mortality', 'GDP', 'Literacy', 'Phones_per_1000', 'Arable',
       'Crops', 'Other', 'Birthrate', 'Deathrate', 'Agriculture', 'Industry',
       'Service'],
      dtype='object')

In [14]:
for variavel in numeric_features:
    copia_country.fillna(copia_country[variavel].mean(), inplace=True)

In [15]:
copia_country.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
Country             227 non-null object
Region              227 non-null object
Population          227 non-null int64
Area                227 non-null int64
Pop_density         227 non-null float64
Coastline_ratio     227 non-null float64
Net_migration       227 non-null float64
Infant_mortality    227 non-null float64
GDP                 227 non-null float64
Literacy            227 non-null float64
Phones_per_1000     227 non-null float64
Arable              227 non-null float64
Crops               227 non-null float64
Other               227 non-null float64
Climate             227 non-null object
Birthrate           227 non-null float64
Deathrate           227 non-null float64
Agriculture         227 non-null float64
Industry            227 non-null float64
Service             227 non-null float64
dtypes: float64(15), int64(2), object(3)
memory usage: 35.5+ KB


In [36]:
standard_scaler = StandardScaler()

In [43]:
copia =copia_country[["Arable"]]

In [41]:
standard_scaler.fit(copia_country[["Arable"]])

score_standardized = standard_scaler.transform(copia_country[["Arable"]])

score_standardized

array([[-0.09428152],
       [-0.09427819],
       [-0.09428484],
       [-0.09428232],
       [-0.09428521],
       [-0.09428514],
       [-0.09428604],
       [-0.09427927],
       [-0.09428146],
       [-0.09427951],
       [-0.09428212],
       [-0.0942836 ],
       [-0.09427975],
       [-0.09427873],
       [-0.09428574],
       [-0.09428499],
       [-0.09426292],
       [-0.09427219],
       [-0.09427504],
       [-0.09427737],
       [-0.09428498],
       [-0.09427931],
       [-0.09427859],
       [-0.09428489],
       [-0.09428505],
       [-0.09428098],
       [-0.0942858 ],
       [-0.09428345],
       [-0.09427859],
       [-0.09428583],
       [-0.09427114],
       [-0.09428067],
       [-0.09428039],
       [-0.09427299],
       [-0.09427824],
       [-0.09428127],
       [-0.09428419],
       [-0.09428244],
       [-0.09428461],
       [-0.09428489],
       [-0.09428498],
       [-0.09428505],
       [-0.09428031],
       [-0.09428514],
       [-0.09427269],
       [-0

In [46]:
pipeline_transformation = num_pipeline.fit_transform(copia_country[["Arable"]])

pipeline_transformation[:10]

array([[-0.09428152],
       [-0.09427819],
       [-0.09428484],
       [-0.09428232],
       [-0.09428521],
       [-0.09428514],
       [-0.09428604],
       [-0.09427927],
       [-0.09428146],
       [-0.09427951]])

In [18]:
countries.shape

(227, 20)

In [47]:

num_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ('scale', StandardScaler())
])

In [48]:
numeric_features = copia_country.select_dtypes(include=[np.number])

numeric_features.columns

Index(['Population', 'Area', 'Pop_density', 'Coastline_ratio', 'Net_migration',
       'Infant_mortality', 'GDP', 'Literacy', 'Phones_per_1000', 'Arable',
       'Crops', 'Other', 'Birthrate', 'Deathrate', 'Agriculture', 'Industry',
       'Service'],
      dtype='object')

In [49]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
Country             227 non-null object
Region              227 non-null object
Population          227 non-null int64
Area                227 non-null int64
Pop_density         227 non-null float64
Coastline_ratio     227 non-null float64
Net_migration       224 non-null float64
Infant_mortality    224 non-null float64
GDP                 226 non-null float64
Literacy            209 non-null float64
Phones_per_1000     223 non-null float64
Arable              225 non-null float64
Crops               225 non-null float64
Other               225 non-null float64
Climate             205 non-null object
Birthrate           224 non-null float64
Deathrate           223 non-null float64
Agriculture         212 non-null float64
Industry            211 non-null float64
Service             212 non-null float64
dtypes: float64(15), int64(2), object(3)
memory usage: 35.5+ KB


In [50]:
pipeline_transformation = num_pipeline.fit_transform(countries[numeric_features.columns])

pipeline_transformation[:1]

array([[ 0.01969468,  0.02758332, -0.19984434, -0.29344342,  4.75079803,
         3.6380982 , -0.89639423, -2.49781686, -1.02749132, -0.12636082,
        -0.51886111,  0.37260169,  2.21296666,  2.2525074 ,  1.63657562,
        -0.31540576, -1.1611354 ]])

In [51]:
pipeline_transformation = num_pipeline.fit_transform(copia_country[numeric_features.columns])

pipeline_transformation[:1]

array([[ 0.01969468,  0.02758332, -0.19984434, -0.29344342, -0.1157205 ,
        -0.11568865, -0.07124277, -0.29347562, -0.13399149, -0.09428152,
        -0.09428252, -0.09427867, -0.11572005, -0.13392697, -0.26599756,
        -0.27537136, -0.26599761]])

In [24]:
type(numeric_features)

pandas.core.frame.DataFrame

## Questão 4

Aplique o seguinte _pipeline_:

1. Preencha as variáveis do tipo `int64` e `float64` com suas respectivas medianas.
2. Padronize essas variáveis.

Após aplicado o _pipeline_ descrito acima aos dados (somente nas variáveis dos tipos especificados), aplique o mesmo _pipeline_ (ou `ColumnTransformer`) ao dado abaixo. Qual o valor da variável `Arable` após o _pipeline_? Responda como um único float arredondado para três casas decimais.

In [25]:
test_country = [
    'Test Country', 'NEAR EAST', -0.19032480757326514,
    -0.3232636124824411, -0.04421734470810142, -0.27528113360605316,
    0.13255850810281325, -0.8054845935643491, 1.0119784924248225,
    0.6189182532646624, 1.0074863283776458, 0.20239896852403538,
    -0.043678728558593366, -0.13929748680369286, 1.3163604645710438,
    -0.3699637766938669, -0.6149300604558857, -0.854369594993175,
    0.263445277972641, 0.5712416961268142
]

In [26]:
teste = pd.DataFrame(test_country).T

In [27]:
teste = teste.drop(teste.columns[[0, 1]], axis=1)

In [28]:
teste.head()

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,-0.190325,-0.323264,-0.0442173,-0.275281,0.132559,-0.805485,1.01198,0.618918,1.00749,0.202399,-0.0436787,-0.139297,1.31636,-0.369964,-0.61493,-0.85437,0.263445,0.571242


In [29]:
pipeline_transformation = num_pipeline.fit_transform(teste)


In [30]:
pipeline_transformation

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.]])

In [31]:
def q4():
    return round(test_country[11], 3)

q4()

0.202

-1.047

## Questão 5

Descubra o número de _outliers_ da variável `Net_migration` segundo o método do _boxplot_, ou seja, usando a lógica:

$$x \notin [Q1 - 1.5 \times \text{IQR}, Q3 + 1.5 \times \text{IQR}] \Rightarrow x \text{ é outlier}$$

que se encontram no grupo inferior e no grupo superior.

Você deveria remover da análise as observações consideradas _outliers_ segundo esse método? Responda como uma tupla de três elementos `(outliers_abaixo, outliers_acima, removeria?)` ((int, int, bool)).

In [32]:
def q5():
    net = countries['Net_migration']
    
    q1 = net.quantile(0.25)
    q3 = net.quantile(0.75)
    iqr = q3 - q1
    fora_intervalo = [q1 - 1.5 * iqr, q3 + 1.5 * iqr]
    
    outliers_abaixo = net[(net < fora_intervalo[0])]
    outliers_acima = net[(net > fora_intervalo[1])]
    
    return (len(outliers_abaixo), len(outliers_acima), False)

q5()

(24, 26, False)

In [33]:
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)

## Questão 6
Para as questões 6 e 7 utilize a biblioteca `fetch_20newsgroups` de datasets de test do `sklearn`

Considere carregar as seguintes categorias e o dataset `newsgroups`:

```
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
```


Aplique `CountVectorizer` ao _data set_ `newsgroups` e descubra o número de vezes que a palavra _phone_ aparece no corpus. Responda como um único escalar.

In [34]:
def q6():
    count_vectorizer = CountVectorizer()
    newsgroups_counts = count_vectorizer.fit_transform(newsgroup.data)
    words_idx = sorted([count_vectorizer.vocabulary_.get(f"{word.lower()}") for word in [u"phone"]])

    telefone = pd.DataFrame(newsgroups_counts[:, words_idx].toarray(), 
                            columns=np.array(count_vectorizer.get_feature_names())[words_idx])
    
    return int(telefone.sum())
    
q6()

213

213

## Questão 7

Aplique `TfidfVectorizer` ao _data set_ `newsgroups` e descubra o TF-IDF da palavra _phone_. Responda como um único escalar arredondado para três casas decimais.

In [35]:
def q7():
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_vectorizer.fit(newsgroup.data)

    newsgroups_tfidf_vectorized = tfidf_vectorizer.transform(newsgroup.data)
    resposta = pd.DataFrame(newsgroups_tfidf_vectorized[:, words_idx].toarray(), 
             columns=np.array(count_vectorizer.get_feature_names())[words_idx])
    return float(round(resposta.sum(), 3))
q7()

NameError: name 'words_idx' is not defined

8.888