# Atividade sobre regressão

Nesta atividade, deverão ser implementados dois modelos de regressão sobre os [dados de salários](https://www.kaggle.com/wenruliu/adult-income-dataset). No conjunto de dados existem 15 variáveis sobre a educação, idade, sexo, entre outras. **Seu objetivo é prever quantas horas uma pessoa trabalha semanalmente** (hours-per-week) de acordo com suas características.

Para realizar a atividade, siga os seguintes passos:

* Separe o conjunto de dados em treino e teste na proporção 80%/20% respectivamente.
* Utilize variáveis dummies para representar variáveis categóricas.
* Treine dois modelos de regressão, vistos até então, sobre o conjunto de treinamento.
* Teste os modelos com o conjunto de teste.
* A partir das predições, calcule o R² e o RMSE.
* Analise os resultados para os dois modelos e informe qual o modelo conseguiu prever melhor os resultados.






In [0]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

trabalhadores = pd.read_csv("https://orionwinter.github.io/datasets/adult.csv")

trabalhadores.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [0]:
trabalhadores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
age                48842 non-null int64
workclass          48842 non-null object
fnlwgt             48842 non-null int64
education          48842 non-null object
educational-num    48842 non-null int64
marital-status     48842 non-null object
occupation         48842 non-null object
relationship       48842 non-null object
race               48842 non-null object
gender             48842 non-null object
capital-gain       48842 non-null int64
capital-loss       48842 non-null int64
hours-per-week     48842 non-null int64
native-country     48842 non-null object
income             48842 non-null object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [0]:
# Função de escalonamento
def feature_scaling(data):
    sc = StandardScaler()
    return sc.fit_transform(data)

In [0]:
#trabalhadores = trabalhadores.query("workclass != '?'")
#trabalhadores

In [0]:
#trabalhadores = pd.get_dummies(trabalhadores , columns = ['workclass'], drop_first=True)
#trabalhadores = pd.get_dummies(trabalhadores , columns = ['education'], drop_first=True)
#trabalhadores = pd.get_dummies(trabalhadores , columns = ['marital-status'], drop_first=True)
#trabalhadores = pd.get_dummies(trabalhadores , columns = ['occupation'], drop_first=True)
#trabalhadores = pd.get_dummies(trabalhadores , columns = ['relationship'], drop_first=True)
#trabalhadores = pd.get_dummies(trabalhadores , columns = ['race'], drop_first=True)
#trabalhadores = pd.get_dummies(trabalhadores , columns = ['gender'], drop_first=True)
#trabalhadores = pd.get_dummies(trabalhadores , columns = ['native-country'], drop_first=True)
#trabalhadores = pd.get_dummies(trabalhadores , columns = ['income'], drop_first=True)

In [0]:
trabalhadores = pd.get_dummies(trabalhadores , columns = ['workclass','education','marital-status','occupation',
                                                          'relationship','race','gender','native-country','income'], drop_first=True)


In [0]:
trabalhadores = trabalhadores.drop(columns=['capital-gain','capital-loss'])

In [0]:
trabalhadores.head()

Unnamed: 0,age,fnlwgt,educational-num,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_11th,education_12th,education_1st-4th,education_5th-6th,education_7th-8th,education_9th,education_Assoc-acdm,education_Assoc-voc,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,education_Preschool,education_Prof-school,education_Some-college,marital-status_Married-AF-spouse,marital-status_Married-civ-spouse,marital-status_Married-spouse-absent,marital-status_Never-married,marital-status_Separated,marital-status_Widowed,occupation_Adm-clerical,occupation_Armed-Forces,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,...,native-country_China,native-country_Columbia,native-country_Cuba,native-country_Dominican-Republic,native-country_Ecuador,native-country_El-Salvador,native-country_England,native-country_France,native-country_Germany,native-country_Greece,native-country_Guatemala,native-country_Haiti,native-country_Holand-Netherlands,native-country_Honduras,native-country_Hong,native-country_Hungary,native-country_India,native-country_Iran,native-country_Ireland,native-country_Italy,native-country_Jamaica,native-country_Japan,native-country_Laos,native-country_Mexico,native-country_Nicaragua,native-country_Outlying-US(Guam-USVI-etc),native-country_Peru,native-country_Philippines,native-country_Poland,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,income_>50K
0,25,226802,7,40,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1,38,89814,9,50,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,28,336951,12,40,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
3,44,160323,10,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
4,18,103497,10,30,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [0]:
X = trabalhadores[trabalhadores.columns[~trabalhadores.columns.isin(['hours-per-week'])]].values
y = trabalhadores['hours-per-week'].values.reshape(-1,1)

# Dividindo os dados
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [0]:
X_test

array([[    56,  33115,      9, ...,      0,      0,      0],
       [    25, 112847,      9, ...,      0,      0,      0],
       [    43, 170525,     13, ...,      0,      0,      1],
       ...,
       [    25, 167835,     13, ...,      0,      0,      1],
       [    18, 170194,      7, ...,      0,      0,      0],
       [    52, 176240,     13, ...,      0,      0,      1]])

In [0]:
#X_train = feature_scaling(X_train)
#X_test = feature_scaling(X_test)

#X = feature_scaling(X)


In [0]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)
#regressor.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
poly_reg_train = PolynomialFeatures(degree = 2)
X_poly_train = poly_reg_train.fit_transform(X_train)
#X_poly_train = poly_reg.fit_transform(X)
#poly_reg.fit(X_poly_train, y)
poly_reg_train.fit(X_poly_train, y_train)

poly_reg_test = PolynomialFeatures(degree = 2)
X_poly_test = poly_reg_test.fit_transform(X_test)
poly_reg_test.fit(X_poly_test, y_test)


PolynomialFeatures(degree=2, include_bias=True, interaction_only=False,
                   order='C')

In [0]:
lin_reg_poly = LinearRegression()
#lin_reg_poly.fit(X_poly_train, y)
lin_reg_poly.fit(X_poly_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
df_results = pd.DataFrame(columns=['Name','r_2_score', 'rmse'])

In [0]:
from math import sqrt
#Regressão Múltipla
#y_pred = regressor.predict(X)
y_pred = regressor.predict(X_test)
#y_pred = regressor.predict(X_poly_test)


df_results.loc[len(df_results), :] = ['Reg. Múltipla', regressor.score(X_test, y_test), sqrt(mean_squared_error(y_test, y_pred))]
#df_results.loc[len(df_results), :] = ['Reg. Múltipla', regressor.score(X, y), sqrt(mean_squared_error(y, y_pred))]

#Regressão Polynomial
y_pred = lin_reg_poly.predict(X_poly_test)
#y_pred = lin_reg_poly.predict(X_poly_train)


df_results.loc[len(df_results), :] = ['Reg Polinomial', lin_reg_poly.score(X_poly_test, y_test), sqrt(mean_squared_error(y_test, y_pred))]
#df_results.loc[len(df_results), :] = ['Reg Polinomial', lin_reg_poly.score(X_poly_train, y), sqrt(mean_squared_error(y, y_pred))]

In [0]:
df_results

Unnamed: 0,Name,r_2_score,rmse
0,Reg. Múltipla,0.191114,10.9023
1,Reg Polinomial,0.0290162,11.9449


De acordo com os valores apresentados nos testes acima, neste caso a regressão multipla é melhor, por apresentar um maior r² e um menor RMSE. Mesmo assim, pode-se perceber que ambos os modelos são ruins para prever o modelo, pois o r² é muito próximo de 0.