# Desafio 6

Neste desafio, vamos praticar _feature engineering_, um dos processos mais importantes e trabalhosos de ML. Utilizaremos o _data set_ [Countries of the world](https://www.kaggle.com/fernandol/countries-of-the-world), que contém dados sobre os 227 países do mundo com informações sobre tamanho da população, área, imigração e setores de produção.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [90]:
import pandas as pd
import numpy as np
import seaborn as sns
#import sklearn as sk
from sklearn.preprocessing import (
    OneHotEncoder, Binarizer, KBinsDiscretizer,
    MinMaxScaler, StandardScaler, PolynomialFeatures
)
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.datasets import load_digits, fetch_20newsgroups
from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfTransformer, TfidfVectorizer
)

In [2]:
# Algumas configurações para o matplotlib.
%matplotlib inline

from IPython.core.pylabtools import figsize


figsize(12, 8)

sns.set()

In [3]:
countries = pd.read_csv("countries.csv")

In [4]:
new_column_names = [
    "Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
    "Industry", "Service"
]

countries.columns = new_column_names

countries.head(5)

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


## Observações

Esse _data set_ ainda precisa de alguns ajustes iniciais. Primeiro, note que as variáveis numéricas estão usando vírgula como separador decimal e estão codificadas como strings. Corrija isso antes de continuar: transforme essas variáveis em numéricas adequadamente.

Além disso, as variáveis `Country` e `Region` possuem espaços a mais no começo e no final da string. Você pode utilizar o método `str.strip()` para remover esses espaços.

## Inicia sua análise a partir daqui

In [5]:
# Sua análise começa aqui.

countries.shape

(227, 20)

*Verificando os tipos de variáveis, a quantidade de valores faltantes e o número de observações distintas*

In [6]:
aux = pd.DataFrame( data = { "tipos" : countries.dtypes, 
                            "n_ausentes" : countries.isna().sum(),
                            "%ausentes": ((countries.isna().sum()/countries.shape[0])*100),
                            "n_distintas" : countries.nunique() } )

In [7]:
aux

Unnamed: 0,tipos,n_ausentes,%ausentes,n_distintas
Country,object,0,0.0,227
Region,object,0,0.0,11
Population,int64,0,0.0,227
Area,int64,0,0.0,226
Pop_density,object,0,0.0,219
Coastline_ratio,object,0,0.0,151
Net_migration,object,3,1.321586,157
Infant_mortality,object,3,1.321586,220
GDP,float64,1,0.440529,130
Literacy,object,18,7.929515,140


_Removendo os espaços do início e final das observações das variáveis **Region** e **Country**_

In [8]:
countries["Region"]

0            ASIA (EX. NEAR EAST)         
1      EASTERN EUROPE                     
2      NORTHERN AFRICA                    
3      OCEANIA                            
4      WESTERN EUROPE                     
                      ...                 
222    NEAR EAST                          
223    NORTHERN AFRICA                    
224    NEAR EAST                          
225    SUB-SAHARAN AFRICA                 
226    SUB-SAHARAN AFRICA                 
Name: Region, Length: 227, dtype: object

In [9]:
countries["Region"].str.strip()

0      ASIA (EX. NEAR EAST)
1            EASTERN EUROPE
2           NORTHERN AFRICA
3                   OCEANIA
4            WESTERN EUROPE
               ...         
222               NEAR EAST
223         NORTHERN AFRICA
224               NEAR EAST
225      SUB-SAHARAN AFRICA
226      SUB-SAHARAN AFRICA
Name: Region, Length: 227, dtype: object

In [10]:
regioes = countries["Region"].str.strip()

regioes = regioes.unique()

regioes

array(['ASIA (EX. NEAR EAST)', 'EASTERN EUROPE', 'NORTHERN AFRICA',
       'OCEANIA', 'WESTERN EUROPE', 'SUB-SAHARAN AFRICA',
       'LATIN AMER. & CARIB', 'C.W. OF IND. STATES', 'NEAR EAST',
       'NORTHERN AMERICA', 'BALTICS'], dtype=object)

In [11]:
regioes = list(regioes)
regioes

['ASIA (EX. NEAR EAST)',
 'EASTERN EUROPE',
 'NORTHERN AFRICA',
 'OCEANIA',
 'WESTERN EUROPE',
 'SUB-SAHARAN AFRICA',
 'LATIN AMER. & CARIB',
 'C.W. OF IND. STATES',
 'NEAR EAST',
 'NORTHERN AMERICA',
 'BALTICS']

In [12]:
regioes.sort()

regioes

['ASIA (EX. NEAR EAST)',
 'BALTICS',
 'C.W. OF IND. STATES',
 'EASTERN EUROPE',
 'LATIN AMER. & CARIB',
 'NEAR EAST',
 'NORTHERN AFRICA',
 'NORTHERN AMERICA',
 'OCEANIA',
 'SUB-SAHARAN AFRICA',
 'WESTERN EUROPE']

In [13]:
countries["Country"]

0         Afghanistan 
1             Albania 
2             Algeria 
3      American Samoa 
4             Andorra 
            ...       
222         West Bank 
223    Western Sahara 
224             Yemen 
225            Zambia 
226          Zimbabwe 
Name: Country, Length: 227, dtype: object

In [14]:
countries["Country"] = countries["Country"].str.strip()
countries["Country"]

0         Afghanistan
1             Albania
2             Algeria
3      American Samoa
4             Andorra
            ...      
222         West Bank
223    Western Sahara
224             Yemen
225            Zambia
226          Zimbabwe
Name: Country, Length: 227, dtype: object

__Obtendo intervalos de classe para a variavel__ *Pop_density*

*Primeiramente iremos trocar a vírgula por ponto em seguida converter as variaveis do tipo objeto para o tipo numerico*

In [15]:
variaveis = [ "Pop_density", "Coastline_ratio",
        "Net_migration", "Infant_mortality", "Literacy", "Phones_per_1000",
        "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
        "Industry", "Service"]

In [16]:
countries[variaveis] = countries[variaveis].apply(lambda x: x.str.replace(',','.').astype(float))

In [17]:
countries.head()

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1.0,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3.0,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1.0,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2.0,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3.0,8.71,6.25,,,


In [18]:
#countries["Pop_density"] = countries["Pop_density"].apply(lambda x: str(x).replace(",","."))
#countries["Pop_density"]

In [19]:
#countries["Pop_density"] = countries["Pop_density"].astype("float64")
#countries["Pop_density"]

*Calculando a classe e os intervalos de classes*

In [20]:
intervalos = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile").fit(countries[ ["Pop_density"] ])

In [21]:
intervalos

KBinsDiscretizer(encode='ordinal', n_bins=10)

*limites dos intervalos*

In [22]:
intervalos.bin_edges_

array([array([0.00000e+00, 1.01400e+01, 2.12200e+01, 3.94800e+01, 5.98000e+01,
       7.88000e+01, 1.05540e+02, 1.50120e+02, 2.53700e+02, 3.96740e+02,
       1.62715e+04])], dtype=object)

*Categorias das classes*

In [23]:
classes = intervalos.transform(countries[ ["Pop_density"] ])
classes

array([[3.],
       [6.],
       [1.],
       [8.],
       [7.],
       [0.],
       [6.],
       [7.],
       [1.],
       [5.],
       [8.],
       [0.],
       [5.],
       [5.],
       [2.],
       [9.],
       [9.],
       [9.],
       [3.],
       [8.],
       [1.],
       [4.],
       [9.],
       [3.],
       [0.],
       [5.],
       [0.],
       [2.],
       [7.],
       [4.],
       [4.],
       [3.],
       [4.],
       [8.],
       [4.],
       [2.],
       [0.],
       [5.],
       [7.],
       [0.],
       [0.],
       [2.],
       [6.],
       [2.],
       [8.],
       [2.],
       [1.],
       [5.],
       [5.],
       [3.],
       [5.],
       [5.],
       [5.],
       [6.],
       [6.],
       [1.],
       [5.],
       [7.],
       [4.],
       [3.],
       [5.],
       [8.],
       [1.],
       [3.],
       [2.],
       [4.],
       [2.],
       [3.],
       [1.],
       [6.],
       [0.],
       [4.],
       [0.],
       [6.],
       [9.],
       [4.],
       [7.],

*Número de paises que estão na categoria 9. Isto é, o número de países acima do percentil 90*

In [24]:
int(sum(classes == 9))

23

*Outra forma de obter o número de países acima do percentil 90*

In [25]:
percentil90 = countries["Pop_density"].quantile(q=0.90)
sum(countries["Pop_density"]>percentil90)

23

*Codificando as variáveis __Region__ e __Climate__*

In [26]:
aux = countries[["Region", "Climate"]]

In [27]:
aux["Climate"].unique()

array([1. , 3. , 2. , nan, 4. , 1.5, 2.5])

In [28]:
aux.isna().sum()

Region      0
Climate    22
dtype: int64

In [29]:
aux.dtypes

Region      object
Climate    float64
dtype: object

In [30]:
aux.nunique()

Region     11
Climate     6
dtype: int64

In [31]:
aux.loc[aux["Climate"].isna(),"Climate"] = "999"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [32]:
aux["Climate"].unique()

array([1.0, 3.0, 2.0, '999', 4.0, 1.5, 2.5], dtype=object)

In [33]:
aux["Climate"] = aux["Climate"].astype("str")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  aux["Climate"] = aux["Climate"].astype("str")


*Verificando o número de novas categorias criadas com as variáveis __Region__ e __Climate__*

In [34]:
codificacao = OneHotEncoder(sparse=False, dtype=np.int)

In [35]:
codificacao.fit_transform(aux[["Region"]])

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0]])

In [36]:
codificacao1 = codificacao.fit_transform(aux[["Region"]])
codificacao1.shape

(227, 11)

In [37]:
len( countries["Climate"].unique() )

7

In [38]:
codificacao.fit_transform(aux[["Climate"]])

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

In [39]:
codificacao2 = codificacao.fit_transform(aux[["Climate"]])
codificacao2.shape

(227, 7)

*total de categorias criadas*

In [40]:
codificacao1.shape[1]+codificacao2.shape[1]

18

*Outra forma de fazer*

In [41]:
int( countries["Region"].nunique() +  len( countries["Climate"].unique() ))

18

*Verificando mais uma vez a base de dados*

In [42]:
countries.head()

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1.0,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3.0,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1.0,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2.0,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3.0,8.71,6.25,,,


*Verificando mais uma vez os tipos de dados*

In [43]:
countries.dtypes

Country              object
Region               object
Population            int64
Area                  int64
Pop_density         float64
Coastline_ratio     float64
Net_migration       float64
Infant_mortality    float64
GDP                 float64
Literacy            float64
Phones_per_1000     float64
Arable              float64
Crops               float64
Other               float64
Climate             float64
Birthrate           float64
Deathrate           float64
Agriculture         float64
Industry            float64
Service             float64
dtype: object

In [44]:
countries.isna().sum()

Country              0
Region               0
Population           0
Area                 0
Pop_density          0
Coastline_ratio      0
Net_migration        3
Infant_mortality     3
GDP                  1
Literacy            18
Phones_per_1000      4
Arable               2
Crops                2
Other                2
Climate             22
Birthrate            3
Deathrate            4
Agriculture         15
Industry            16
Service             15
dtype: int64

In [45]:
countries.columns

Index(['Country', 'Region', 'Population', 'Area', 'Pop_density',
       'Coastline_ratio', 'Net_migration', 'Infant_mortality', 'GDP',
       'Literacy', 'Phones_per_1000', 'Arable', 'Crops', 'Other', 'Climate',
       'Birthrate', 'Deathrate', 'Agriculture', 'Industry', 'Service'],
      dtype='object')

In [46]:
variaveis = [ "Population", "Area", "Pop_density", "Coastline_ratio",
        "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
        "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
        "Industry", "Service"]

In [47]:
df_aux = countries[variaveis]

In [48]:
preprocessamento = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="median")),
    ('scaler', StandardScaler()) ] )

In [49]:
 preprocessamento.fit(df_aux)

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [50]:
test_country = [
    'Test Country', 'NEAR EAST', -0.19032480757326514,
    -0.3232636124824411, -0.04421734470810142, -0.27528113360605316,
    0.13255850810281325, -0.8054845935643491, 1.0119784924248225,
    0.6189182532646624, 1.0074863283776458, 0.20239896852403538,
    -0.043678728558593366, -0.13929748680369286, 1.3163604645710438,
    -0.3699637766938669, -0.6149300604558857, -0.854369594993175,
    0.263445277972641, 0.5712416961268142
]

In [51]:
test_country

['Test Country',
 'NEAR EAST',
 -0.19032480757326514,
 -0.3232636124824411,
 -0.04421734470810142,
 -0.27528113360605316,
 0.13255850810281325,
 -0.8054845935643491,
 1.0119784924248225,
 0.6189182532646624,
 1.0074863283776458,
 0.20239896852403538,
 -0.043678728558593366,
 -0.13929748680369286,
 1.3163604645710438,
 -0.3699637766938669,
 -0.6149300604558857,
 -0.854369594993175,
 0.263445277972641,
 0.5712416961268142]

In [52]:
transf_prepro = preprocessamento.transform([test_country[2:]])

In [53]:
transf_prepro

array([[-0.24432501, -0.33489095, -0.22884735, -0.29726002,  0.01959086,
        -1.02861728, -0.96623348, -4.35427242, -1.03720972, -1.04685743,
        -0.55058149, -5.10112169, -1.21812201, -2.02455164, -1.99092137,
        -7.04915046, -0.13915481,  0.03490335]])

In [54]:
transf_prepro[0][9].round(3)

-1.047

*Detectando valores extremos*

In [55]:
countries["Net_migration"].isna().sum()

3

In [56]:
countries["Net_migration"]

0      23.06
1      -4.93
2      -0.39
3     -20.71
4       6.60
       ...  
222     2.98
223      NaN
224     0.00
225     0.00
226     0.00
Name: Net_migration, Length: 227, dtype: float64

In [57]:
#countries["Net_migration"].fillna(value=countries["Net_migration"].median(), inplace=True )

In [58]:
variavel = countries["Net_migration"].dropna()

In [59]:
variavel.isna().sum()

0

*Obtendo o 1 e 3 quantil e calculando o intervalo interquartilico*

In [60]:
quantil3 = countries["Net_migration"].quantile(0.75)
quantil1 = countries["Net_migration"].quantile(0.25)
dif_quantis = quantil3 - quantil1

In [61]:
dif_quantis

1.9249999999999998

*Calculando a faixa de referência*

In [62]:
faixa = [quantil1 - 1.5*dif_quantis, quantil3 + 1.5*dif_quantis]
faixa

[-3.8149999999999995, 3.885]

In [63]:
countries["Net_migration"]

0      23.06
1      -4.93
2      -0.39
3     -20.71
4       6.60
       ...  
222     2.98
223      NaN
224     0.00
225     0.00
226     0.00
Name: Net_migration, Length: 227, dtype: float64

*Número de valores extremos inferiores*

In [64]:
sum( (countries["Net_migration"] < faixa[0]) )

24

*Número de valores extremos superiores*

In [65]:
sum( (countries["Net_migration"] > faixa[1]) )

26

*Total de valores extremos*

In [66]:
sum( (countries["Net_migration"] < faixa[0]) | (countries["Net_migration"] > faixa[1]) )

50

*Valores extremos*

In [67]:
countries["Net_migration"].loc[ (countries["Net_migration"] < faixa[0]) | (countries["Net_migration"] > faixa[1]) ]

0      23.06
1      -4.93
3     -20.71
4       6.60
6      10.76
7      -6.15
9      -6.47
11      3.98
13     -4.90
28     10.01
30     -4.58
36      5.96
37    -12.07
38     18.75
56    -13.87
59     -8.58
70      6.27
75     -4.70
80     -8.37
81    -13.92
91      5.24
98      4.99
99      5.36
102    -4.92
105     6.59
111    14.18
119     4.85
121     8.97
122     4.86
130    -6.04
134     6.78
135    -4.87
136   -20.99
138     7.75
149     4.05
153     9.61
166    16.29
172    -7.11
174    -4.86
175    -7.64
176   -11.70
177    10.98
182    -5.69
184    11.53
188     5.37
193    -8.81
196     4.05
204   -10.83
208    11.68
220    -8.94
Name: Net_migration, dtype: float64

In [83]:
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)

In [85]:
len(newsgroup.data)

1773

In [84]:
len(newsgroup)

5

In [88]:
newsgroup

{'data': ['From: rubin@cis.ohio-state.edu (Daniel J Rubin)\nSubject: Re: what to do with old 256k SIMMs?\nOrganization: The Ohio State University Dept. of Computer and Info. Science\nLines: 18\nNNTP-Posting-Host: diplodocus.cis.ohio-state.edu\n\n>>\tI was wondering if people had any good uses for old\n>>256k SIMMs.  I have a bunch of them for the Apple Mac\n>>and I know lots of other people do to.  I have tried to\n>>sell them but have gotten NO interest.\n\nHow hard would it be to somehow interface them to some of the popular \nMotorola microcontrollers.  I am a novice at microcontrollers, but I am\nstarting to get into them for some of my projects.  I have several 256k\nSIMMs laying around from upgraded Macs and if I could use them as "free"\nmemory in one or two of my projects that would be great.  One project that\ncomes to mind is a Caller ID device that would require quite a bit of RAM\nto store several hundered CID records etc...\n\n                                              

In [91]:
contar = CountVectorizer()

In [94]:
contar_newsgroups = contar.fit_transform(newsgroup.data)

In [97]:
id_palavra = sorted( [contar.vocabulary_.get("phone") ] )

In [98]:
id_palavra

[19211]

In [103]:
df = pd.DataFrame( contar_newsgroups[:, id_palavra].toarray(), 
                  columns = np.array( contar.get_feature_names() )[id_palavra] ) 

In [104]:
df

Unnamed: 0,phone
0,0
1,0
2,0
3,0
4,0
...,...
1768,0
1769,0
1770,0
1771,0


In [105]:
df.sum()

phone    213
dtype: int64

In [106]:
float(df.sum())

213.0

In [108]:
tf_idf = TfidfTransformer()

In [109]:
tf_idf.fit(contar_newsgroups)

TfidfTransformer()

In [111]:
tf_idf_newsgroups = tf_idf.transform(contar_newsgroups)

In [112]:
df_tf_idf = pd.DataFrame( tf_idf_newsgroups[:, id_palavra].toarray(), 
                  columns = np.array( contar.get_feature_names() )[id_palavra] ) 

In [113]:
df_tf_idf

Unnamed: 0,phone
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
...,...
1768,0.0
1769,0.0
1770,0.0
1771,0.0


In [114]:
df_tf_idf.sum()

phone    8.887746
dtype: float64

In [117]:
float(df_tf_idf.sum().round(3))

8.888

## Questão 1

Quais são as regiões (variável `Region`) presentes no _data set_? Retorne uma lista com as regiões únicas do _data set_ com os espaços à frente e atrás da string removidos (mas mantenha pontuação: ponto, hífen etc) e ordenadas em ordem alfabética.

In [68]:
def q1():
    # Retorne aqui o resultado da questão 1.
    regioes = countries["Region"].str.strip().unique()
    regioes = list(regioes)
    regioes.sort()
    return regioes

## Questão 2

Discretizando a variável `Pop_density` em 10 intervalos com `KBinsDiscretizer`, seguindo o encode `ordinal` e estratégia `quantile`, quantos países se encontram acima do 90º percentil? Responda como um único escalar inteiro.

In [69]:
def q2():
    # Retorne aqui o resultado da questão 2.
    countries["Pop_density"] = countries["Pop_density"].apply(lambda x: str(x).replace(",","."))
    countries["Pop_density"] = countries["Pop_density"].astype("float64")
    intervalos = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile").fit(countries[ ["Pop_density"] ])
    classes = intervalos.transform(countries[ ["Pop_density"] ])
    return int(sum(classes == 9))

# Questão 3

Se codificarmos as variáveis `Region` e `Climate` usando _one-hot encoding_, quantos novos atributos seriam criados? Responda como um único escalar.

In [70]:
def q3():
    # Retorne aqui o resultado da questão 3.
    #aux = countries[["Region", "Climate"]]
    #aux.loc[aux["Climate"].isna(),"Climate"] = "999"
    #aux["Climate"] = aux["Climate"].astype("str")
    #codificacao = OneHotEncoder(sparse=False, dtype=np.int)
    #codificacao1 = codificacao.fit_transform(aux[["Region"]])
    #codificacao2 = codificacao.fit_transform(aux[["Climate"]])
    #return codificacao1.shape[1]+codificacao2.shape[1]
    #outra forma de fazer seria colocar a funcao para retornar apenas 
    return int( countries["Region"].nunique() +  len( countries["Climate"].unique() )) 

## Questão 4

Aplique o seguinte _pipeline_:

1. Preencha as variáveis do tipo `int64` e `float64` com suas respectivas medianas.
2. Padronize essas variáveis.

Após aplicado o _pipeline_ descrito acima aos dados (somente nas variáveis dos tipos especificados), aplique o mesmo _pipeline_ (ou `ColumnTransformer`) ao dado abaixo. Qual o valor da variável `Arable` após o _pipeline_? Responda como um único float arredondado para três casas decimais.

In [71]:
test_country = [
    'Test Country', 'NEAR EAST', -0.19032480757326514,
    -0.3232636124824411, -0.04421734470810142, -0.27528113360605316,
    0.13255850810281325, -0.8054845935643491, 1.0119784924248225,
    0.6189182532646624, 1.0074863283776458, 0.20239896852403538,
    -0.043678728558593366, -0.13929748680369286, 1.3163604645710438,
    -0.3699637766938669, -0.6149300604558857, -0.854369594993175,
    0.263445277972641, 0.5712416961268142
]

In [72]:
def q4():
    # Retorne aqui o resultado da questão 4.
    variaveis = [ "Population", "Area", "Pop_density", "Coastline_ratio",
        "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
        "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
        "Industry", "Service"]
    #countries[variaveis] = countries[variaveis].apply(lambda x: x.str.replace(',','.').astype(float))
    preprocessamento = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="median")), ('scaler', StandardScaler()) ] )
    preprocessamento.fit(countries[variaveis])
    transf_prepro = preprocessamento.transform([test_country[2:]])
    return transf_prepro[0][9].round(3)

## Questão 5

Descubra o número de _outliers_ da variável `Net_migration` segundo o método do _boxplot_, ou seja, usando a lógica:

$$x \notin [Q1 - 1.5 \times \text{IQR}, Q3 + 1.5 \times \text{IQR}] \Rightarrow x \text{ é outlier}$$

que se encontram no grupo inferior e no grupo superior.

Você deveria remover da análise as observações consideradas _outliers_ segundo esse método? Responda como uma tupla de três elementos `(outliers_abaixo, outliers_acima, removeria?)` ((int, int, bool)).

In [80]:
def q5():
    # Retorne aqui o resultado da questão 4.
    quantil1, quantil3 = countries["Net_migration"].quantile(q=[0.25, 0.75])
    #quantil1 = countries["Net_migration"].quantile(0.25)
    dif_quantis = quantil3 - quantil1
    faixa = [quantil1 - 1.5*dif_quantis, quantil3 + 1.5*dif_quantis]
    resp1 = sum( countries["Net_migration"] < faixa[0] )
    resp2 = sum( countries["Net_migration"] > faixa[1] )
    return tuple( [resp1, resp2, False] )

## Questão 6
Para as questões 6 e 7 utilize a biblioteca `fetch_20newsgroups` de datasets de test do `sklearn`

Considere carregar as seguintes categorias e o dataset `newsgroups`:

```
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
```


Aplique `CountVectorizer` ao _data set_ `newsgroups` e descubra o número de vezes que a palavra _phone_ aparece no corpus. Responda como um único escalar.

In [107]:
def q6():
    # Retorne aqui o resultado da questão 4.
    categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
    newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
    contar = CountVectorizer()
    contar_newsgroups = contar.fit_transform(newsgroup.data)
    id_palavra = sorted( [contar.vocabulary_.get("phone") ] )
    df = pd.DataFrame( contar_newsgroups[:, id_palavra].toarray(), 
                  columns = np.array( contar.get_feature_names() )[id_palavra] ) 
    return float(df.sum())

## Questão 7

Aplique `TfidfVectorizer` ao _data set_ `newsgroups` e descubra o TF-IDF da palavra _phone_. Responda como um único escalar arredondado para três casas decimais.

In [118]:
def q7():
    # Retorne aqui o resultado da questão 4.
    categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
    newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
    contar = CountVectorizer()
    contar_newsgroups = contar.fit_transform(newsgroup.data)
    id_palavra = sorted( [contar.vocabulary_.get("phone") ] )
    tf_idf = TfidfTransformer()
    tf_idf.fit(contar_newsgroups)
    tf_idf_newsgroups = tf_idf.transform(contar_newsgroups)
    df_tf_idf = pd.DataFrame( tf_idf_newsgroups[:, id_palavra].toarray(), 
                  columns = np.array( contar.get_feature_names() )[id_palavra] ) 
    return float(df_tf_idf.sum().round(3))