<center>
<img src="images/ods_stickers.jpg" />
    
## Introducción al Machine Learning

Basado en material de  [Yury Kashnitsky](https://yorko.github.io). Traducido y editado al español por [Ana Georgina Flesia](https://www.linkedin.com/in/georginaflesia/). Este material esta sujeto a los términos y condiciones de la licencia  [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Se permite el uso irrestricto para todo propósito no comercial.

# <center> Ejercicio 1. Análisis exploratorio con Pandas

<img src="images/pandas.jpg"  width=50% />


**En esta tarea se tiene que usar Pandas para responder preguntas sobre el  dataset [Adult](https://archive.ics.uci.edu/ml/datasets/Adult). **

Variables y su tipo:
- `age`: continuous.
- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- `fnlwgt`: continuous.
- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- `education-num`: continuous.
- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- `sex`: Female, Male.
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.   
- `salary`: >50K,<=50K

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 100)
# para dibujar gráficos en jupyter notebook
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
# no nos gunstan los avisos
# se puden comentar las dos lineas si se quiere tener avisos
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('data/adult.data.csv')
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**1. Cuantas mujeres y hombres (variable *sex* ) estan representadas en este dataset?** 

In [3]:
data.sex.value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

**2. Cual es la edad promedio (variable *age* ) de las mujeres?**

In [4]:
femaleMask = data['sex'] == 'Female'
# Filtramos las mujeres y usamos la
# funcion describe para obtener los estadisticos
data[femaleMask]['age'].describe()

count    10771.000000
mean        36.858230
std         14.013697
min         17.000000
25%         25.000000
50%         35.000000
75%         46.000000
max         90.000000
Name: age, dtype: float64

**3. Cual es el porcentaje de ciudadanos alemanes (variable *native-country*)?**

In [5]:
# Contamos los valores y le pasamos a pandas el parametro de normalizacion
data['native-country'].value_counts(normalize=True)

United-States                 0.895857
Mexico                        0.019748
?                             0.017905
Philippines                   0.006081
Germany                       0.004207
Canada                        0.003716
Puerto-Rico                   0.003501
El-Salvador                   0.003255
India                         0.003071
Cuba                          0.002918
England                       0.002764
Jamaica                       0.002488
South                         0.002457
China                         0.002303
Italy                         0.002242
Dominican-Republic            0.002150
Vietnam                       0.002058
Guatemala                     0.001966
Japan                         0.001904
Poland                        0.001843
Columbia                      0.001812
Taiwan                        0.001566
Haiti                         0.001351
Iran                          0.001321
Portugal                      0.001136
Nicaragua                

**4. Cuales son la media y la desviacion estandard de la edad para aquellos que ganan mas de 50k por año (variable *salary*) ?**

In [6]:
# Filtramos los datos para saber quienes ganan mas y calculamos los estadisticos para la edad
highSalaryMask = data['salary'] == '>50K'
data[highSalaryMask]['age'].describe()

count    7841.000000
mean       44.249841
std        10.519028
min        19.000000
25%        36.000000
50%        44.000000
75%        51.000000
max        90.000000
Name: age, dtype: float64

**5. Cuales son la media y la desviacion estandard de la edad para aquellos que ganan menos de 50k por año (variable *salary*)?**

In [7]:
# Filtramos los datos por salarios mas bajos
lowSalaryMask = data['salary'] == '<=50K'
data[lowSalaryMask]['age'].describe()

count    24720.000000
mean        36.783738
std         14.020088
min         17.000000
25%         25.000000
50%         34.000000
75%         46.000000
max         90.000000
Name: age, dtype: float64

**6. Es cierto que la gente que gana mas que 50K tiene al menos educacion media completa? (variable *education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters* or *Doctorate*)**

No, no es cierto.

In [12]:
highSalaryData = data[highSalaryMask]
highSalaryData['education'].value_counts()

Bachelors       2221
HS-grad         1675
Some-college    1387
Masters          959
Prof-school      423
Assoc-voc        361
Doctorate        306
Assoc-acdm       265
10th              62
11th              60
7th-8th           40
12th              33
9th               27
5th-6th           16
1st-4th            6
Name: education, dtype: int64

**7. Muestre los estadisticos para cada raza (*race* feature) y cada género (*sex* feature). Use *groupby()* y *describe()*. Encuentre la edad maxima de los hombres de la raza *Amer-Indian-Eskimo* .**

In [8]:
# Entre su código python aquí
sexRaceStatistics = data.groupby(by=['sex','race']).describe()

In [9]:
sexRaceStatistics

Unnamed: 0_level_0,Unnamed: 1_level_0,age,age,age,age,age,age,age,age,fnlwgt,fnlwgt,fnlwgt,fnlwgt,fnlwgt,fnlwgt,fnlwgt,fnlwgt,education-num,education-num,education-num,education-num,education-num,education-num,education-num,education-num,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-gain,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss,capital-loss,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
sex,race,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2
Female,Amer-Indian-Eskimo,119.0,37.117647,13.114991,17.0,27.0,36.0,46.0,80.0,119.0,112950.731092,93207.974077,12285.0,31387.0,87950.0,163027.5,445168.0,119.0,9.697479,2.33454,2.0,9.0,10.0,11.0,16.0,119.0,544.605042,2451.591587,0.0,0.0,0.0,0.0,15024.0,119.0,14.462185,157.763811,0.0,0.0,0.0,0.0,1721.0,119.0,36.579832,11.046509,4.0,35.0,40.0,40.0,84.0
Female,Asian-Pac-Islander,346.0,35.089595,12.300845,17.0,25.0,33.0,43.75,75.0,346.0,147452.075145,76401.627757,19914.0,86879.25,131986.0,175705.75,379046.0,346.0,10.390173,2.796647,1.0,9.0,10.0,13.0,15.0,346.0,778.436416,7675.228631,0.0,0.0,0.0,0.0,99999.0,346.0,50.852601,296.529225,0.0,0.0,0.0,0.0,2258.0,346.0,37.439306,12.479459,1.0,35.0,40.0,40.0,99.0
Female,Black,1555.0,37.854019,12.637197,17.0,28.0,37.0,46.0,90.0,1555.0,212971.387781,109971.263983,19752.0,142666.5,193553.0,253759.0,930948.0,1555.0,9.549839,2.207815,1.0,9.0,9.0,10.0,16.0,1555.0,516.593569,5312.749129,0.0,0.0,0.0,0.0,99999.0,1555.0,45.450804,299.099591,0.0,0.0,0.0,0.0,4356.0,1555.0,36.834084,9.41996,2.0,35.0,40.0,40.0,99.0
Female,Other,109.0,31.678899,11.631599,17.0,23.0,29.0,39.0,74.0,109.0,172519.642202,77766.666801,24562.0,119890.0,171199.0,219441.0,388741.0,109.0,8.899083,3.027482,2.0,7.0,9.0,10.0,14.0,109.0,254.669725,1317.32646,0.0,0.0,0.0,0.0,7688.0,109.0,36.284404,231.796929,0.0,0.0,0.0,0.0,1740.0,109.0,35.926606,10.300761,6.0,30.0,40.0,40.0,65.0
Female,White,8642.0,36.811618,14.329093,17.0,25.0,35.0,46.0,90.0,8642.0,183549.966906,101710.294874,19395.0,115914.75,175810.5,224836.5,1484705.0,8642.0,10.12798,2.368115,1.0,9.0,10.0,12.0,16.0,8642.0,573.610391,4763.131649,0.0,0.0,0.0,0.0,99999.0,8642.0,65.390535,352.330817,0.0,0.0,0.0,0.0,4356.0,8642.0,36.296691,12.190951,1.0,30.0,40.0,40.0,99.0
Male,Amer-Indian-Eskimo,192.0,37.208333,12.049563,17.0,28.0,35.0,45.0,82.0,192.0,125715.364583,85063.251595,13769.0,48197.75,113091.0,182656.0,356015.0,192.0,9.072917,2.268587,2.0,9.0,9.0,10.0,16.0,192.0,675.260417,2929.745443,0.0,0.0,0.0,0.0,27828.0,192.0,46.395833,286.562584,0.0,0.0,0.0,0.0,1980.0,192.0,42.197917,11.59628,3.0,40.0,40.0,45.0,84.0
Male,Asian-Pac-Islander,693.0,39.073593,12.883944,18.0,29.0,37.0,46.0,90.0,693.0,166175.865801,88552.9526,14878.0,98350.0,147719.0,200117.0,506329.0,693.0,11.24531,2.777463,1.0,9.0,11.0,13.0,16.0,693.0,1827.813853,10947.525528,0.0,0.0,0.0,0.0,99999.0,693.0,120.373737,472.917697,0.0,0.0,0.0,0.0,2457.0,693.0,41.468975,12.387563,1.0,40.0,40.0,45.0,99.0
Male,Black,1569.0,37.6826,12.882612,17.0,27.0,36.0,46.0,90.0,1569.0,242920.644997,134145.970948,21856.0,156410.0,221196.0,298601.0,1268339.0,1569.0,9.423199,2.382841,1.0,9.0,9.0,10.0,16.0,1569.0,702.45443,4962.113183,0.0,0.0,0.0,0.0,99999.0,1569.0,75.186106,370.976546,0.0,0.0,0.0,0.0,2824.0,1569.0,39.997451,10.909413,1.0,40.0,40.0,40.0,99.0
Male,Other,162.0,34.654321,11.355531,17.0,26.0,32.0,42.0,77.0,162.0,213679.104938,92187.362738,25610.0,150726.75,208516.5,253334.75,481175.0,162.0,8.802469,3.361897,1.0,8.0,9.0,10.0,16.0,162.0,1392.185185,11093.711595,0.0,0.0,0.0,0.0,99999.0,162.0,77.746914,370.98672,0.0,0.0,0.0,0.0,2179.0,162.0,41.851852,11.084779,5.0,40.0,40.0,40.0,98.0
Male,White,19174.0,39.652498,13.436029,17.0,29.0,38.0,49.0,90.0,19174.0,188987.386148,103714.59885,18827.0,117381.0,178662.5,236858.75,1455435.0,19174.0,10.138521,2.656464,1.0,9.0,10.0,13.0,16.0,19174.0,1368.674455,8442.830669,0.0,0.0,0.0,0.0,99999.0,19174.0,102.261343,434.156936,0.0,0.0,0.0,0.0,3770.0,19174.0,42.668822,12.194633,1.0,40.0,40.0,50.0,99.0


**8. De las siguientes categorías, cual es la que tiene la mayor proporcion de ricos (>50K): married or single men (variable *marital-status*)? Considere como casados  aquellos que tienen *marital-status* comenzando por *Married* (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), el resto son bachelors.**

In [29]:
marriedValues = ['Married-civ-spouse','Married-spouse-absent','Married-AF-spouse']
marriedMask = data['marital-status'].isin(marriedValues)
maleMask = data['sex'] == 'Male'
# Primero filtramos los datos segun el criterio de estado civil y ser hombre
marriedMaleData = data[marriedMask & maleMask]
singleMaleData = data[~marriedMask & maleMask]

In [30]:
marriedData['salary'].value_counts(normalize=True)

<=50K    0.56308
>50K     0.43692
Name: salary, dtype: float64

In [31]:
singleData['salary'].value_counts(normalize=True)

<=50K    0.935546
>50K     0.064454
Name: salary, dtype: float64

Claramente el grupo de hombres casados tiene la mayor proporcion de ricos (0.56%)

**9. Cual es el máximo de horas que una persona trabaja por semana? (variable *hours-per-week*)? Cuantas personas trabajan ese numero de horas y cual es el porcentaje entre esas personas que ademas ganan mucho (>50K) ?**

In [44]:
maxHourWorked = data['hours-per-week'].max()
max_workers_df = data[data['hours-per-week'] == maxHourWorked]
print("{} personas trabajan {} horas, que es el maximo por semana.".format(max_workers_df.shape[0],maxHourWorked))
max_workers_df.salary.value_counts(normalize=True)


85 personas trabajan 99 horas, que es el maximo por semana.


<=50K    0.705882
>50K     0.294118
Name: salary, dtype: float64

**10. Cuente el numero de horas de trabajo (*hours-per-week*) de aquellos que ganan poco, y de los que ganan mucho (*salary*) para cada pais (*native-country*). Cuales son esos conteos para Japon?**

In [46]:
working_hours = data.groupby(by=['native-country','salary'])['hours-per-week'].agg('mean').to_frame().T
working_hours

native-country,?,?,Cambodia,Cambodia,Canada,Canada,China,China,Columbia,Columbia,Cuba,Cuba,Dominican-Republic,Dominican-Republic,Ecuador,Ecuador,El-Salvador,El-Salvador,England,England,France,France,Germany,Germany,Greece,Greece,Guatemala,Guatemala,Haiti,Haiti,Holand-Netherlands,Honduras,Honduras,Hong,Hong,Hungary,Hungary,India,India,Iran,Iran,Ireland,Ireland,Italy,Italy,Jamaica,Jamaica,Japan,Japan,Laos,Laos,Mexico,Mexico,Nicaragua,Nicaragua,Outlying-US(Guam-USVI-etc),Peru,Peru,Philippines,Philippines,Poland,Poland,Portugal,Portugal,Puerto-Rico,Puerto-Rico,Scotland,Scotland,South,South,Taiwan,Taiwan,Thailand,Thailand,Trinadad&Tobago,Trinadad&Tobago,United-States,United-States,Vietnam,Vietnam,Yugoslavia,Yugoslavia
salary,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K,<=50K,>50K
hours-per-week,40.16476,45.547945,41.416667,40.0,37.914634,45.641026,37.381818,38.9,38.684211,50.0,37.985714,42.44,42.338235,47.0,38.041667,48.75,36.030928,45.0,40.483333,44.533333,41.058824,50.75,39.139785,44.977273,41.809524,50.625,39.360656,36.666667,36.325,42.75,40.0,34.333333,60.0,39.142857,45.0,31.3,50.0,38.233333,46.475,41.44,47.5,40.947368,48.0,39.625,45.4,38.239437,41.1,41.0,47.958333,40.375,40.0,40.003279,46.575758,36.09375,37.5,41.857143,35.068966,40.0,38.065693,43.032787,38.166667,39.0,41.939394,41.5,38.470588,39.416667,39.444444,46.666667,40.15625,51.4375,33.774194,46.8,42.866667,58.333333,37.058824,40.0,38.799127,45.505369,37.193548,39.2,41.6,49.5


In [52]:
working_hours['Japan']

salary,<=50K,>50K
hours-per-week,41.0,47.958333
