# Data Science 5K Capstone Proposal
In order to get your capstone approved, you must complete all of the following steps.

## 1) Get your data
You may use any data set(s) you like, so long as they meet these criteria:

* Your data cannot have _anything_ to do with your work at Booz Allen Hamilton.
* Your data must be publically available for free.
* Your data should be interesting to _you_. You want your capstone to be something you're proud of.
* Your data should be "big enough":
    - It should have at least 1,000 rows.
    - It should have enough of columns to be interesting.
    - If you have questions, contact a member of the instructional team.

## 2) Provide a link to your data
Your data is required to be free and open to anyone. As such, you should have a URL which anyone can use to download your data:

https://www.kaggle.com/gbonesso/enem2015/home

Through this information, I found the source website so that I could download the most recent year data (2017): 

http://portal.inep.gov.br/microdados

## 3) Import your data
In the space below, import your data. If your data span multiple files, read them all in. If applicable, merge or append them as needed.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 

%matplotlib inline

In [2]:
!pwd

/c/Users/602770/teds/PROJECT-5-CAPSTONE


In [3]:
path = "../enem2017/DADOS/MICRODADOS_ENEM_2017.csv"

In [4]:
enem = pd.read_csv(path, encoding="latin1", sep=";", low_memory=False)

In [5]:
enem.drop(["NU_ANO","CO_MUNICIPIO_RESIDENCIA","CO_UF_RESIDENCIA", "CO_MUNICIPIO_PROVA", "CO_UF_PROVA"], axis=1, inplace=True)

In [6]:
nordeste = enem.loc[enem["SG_UF_RESIDENCIA"].isin(["AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE"]), :]

In [7]:
nordeste.to_csv("nordeste17.csv", encoding="latin1", index=False)

In [8]:
pd.set_option('max_rows', None)     
pd.set_option('max_columns', None)

In [9]:
pd.options.display.float_format = '{:.8f}'.format

## 4) Show me the head of your data.

In [10]:
nordeste.head()

Unnamed: 0,NU_INSCRICAO,NO_MUNICIPIO_RESIDENCIA,SG_UF_RESIDENCIA,NU_IDADE,TP_SEXO,TP_ESTADO_CIVIL,TP_COR_RACA,TP_NACIONALIDADE,CO_MUNICIPIO_NASCIMENTO,NO_MUNICIPIO_NASCIMENTO,CO_UF_NASCIMENTO,SG_UF_NASCIMENTO,TP_ST_CONCLUSAO,TP_ANO_CONCLUIU,TP_ESCOLA,TP_ENSINO,IN_TREINEIRO,CO_ESCOLA,CO_MUNICIPIO_ESC,NO_MUNICIPIO_ESC,CO_UF_ESC,SG_UF_ESC,TP_DEPENDENCIA_ADM_ESC,TP_LOCALIZACAO_ESC,TP_SIT_FUNC_ESC,IN_BAIXA_VISAO,IN_CEGUEIRA,IN_SURDEZ,IN_DEFICIENCIA_AUDITIVA,IN_SURDO_CEGUEIRA,IN_DEFICIENCIA_FISICA,IN_DEFICIENCIA_MENTAL,IN_DEFICIT_ATENCAO,IN_DISLEXIA,IN_DISCALCULIA,IN_AUTISMO,IN_VISAO_MONOCULAR,IN_OUTRA_DEF,IN_GESTANTE,IN_LACTANTE,IN_IDOSO,IN_ESTUDA_CLASSE_HOSPITALAR,IN_SEM_RECURSO,IN_BRAILLE,IN_AMPLIADA_24,IN_AMPLIADA_18,IN_LEDOR,IN_ACESSO,IN_TRANSCRICAO,IN_LIBRAS,IN_LEITURA_LABIAL,IN_MESA_CADEIRA_RODAS,IN_MESA_CADEIRA_SEPARADA,IN_APOIO_PERNA,IN_GUIA_INTERPRETE,IN_COMPUTADOR,IN_CADEIRA_ESPECIAL,IN_CADEIRA_CANHOTO,IN_CADEIRA_ACOLCHOADA,IN_PROVA_DEITADO,IN_MOBILIARIO_OBESO,IN_LAMINA_OVERLAY,IN_PROTETOR_AURICULAR,IN_MEDIDOR_GLICOSE,IN_MAQUINA_BRAILE,IN_SOROBAN,IN_MARCA_PASSO,IN_SONDA,IN_MEDICAMENTOS,IN_SALA_INDIVIDUAL,IN_SALA_ESPECIAL,IN_SALA_ACOMPANHANTE,IN_MOBILIARIO_ESPECIFICO,IN_MATERIAL_ESPECIFICO,IN_NOME_SOCIAL,NO_MUNICIPIO_PROVA,SG_UF_PROVA,TP_PRESENCA_CN,TP_PRESENCA_CH,TP_PRESENCA_LC,TP_PRESENCA_MT,CO_PROVA_CN,CO_PROVA_CH,CO_PROVA_LC,CO_PROVA_MT,NU_NOTA_CN,NU_NOTA_CH,NU_NOTA_LC,NU_NOTA_MT,TX_RESPOSTAS_CN,TX_RESPOSTAS_CH,TX_RESPOSTAS_LC,TX_RESPOSTAS_MT,TP_LINGUA,TX_GABARITO_CN,TX_GABARITO_CH,TX_GABARITO_LC,TX_GABARITO_MT,TP_STATUS_REDACAO,NU_NOTA_COMP1,NU_NOTA_COMP2,NU_NOTA_COMP3,NU_NOTA_COMP4,NU_NOTA_COMP5,NU_NOTA_REDACAO,Q001,Q002,Q003,Q004,Q005,Q006,Q007,Q008,Q009,Q010,Q011,Q012,Q013,Q014,Q015,Q016,Q017,Q018,Q019,Q020,Q021,Q022,Q023,Q024,Q025,Q026,Q027
4,170001663646,Maceió,AL,40.0,M,0.0,3,1,2704302.0,Maceió,27.0,AL,1,11,1,,0,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Maceió,AL,1.0,1.0,1.0,1.0,392.0,395.0,399.0,404.0,482.1,569.2,570.8,584.6,EAAAAAEEEABCDEABDDBAEDDDBBDBCAEEBEEBCACBACBEE,EDCEAABBECBDBDADDDADCAEEEDEBBEEEACADAEAACECBA,EDBED99999BADEDEABCCBDCEAEDEEBEBBBDEADBEEAADAE...,CACBDEACECECEECDDBBBDBABCDBCCEEDBBCCDDCCDACED,0,BCBBEBAAEDDCBDADEEBADECDCBCDAAEABCEEAAECDCCDA,CDDECADBEABDBEDAECAEBDAEBAEDBDBBAECDAEBCCCCDE,DDCDEEDBEEBDAEDAABCECDAEBADEDEDBBBDEABBCCABAAE...,CCECEECDADBBDBBDBAEBDDABECBDCCDEDBBACAEADAEAC,1.0,140.0,120.0,120.0,120.0,80.0,580.0,A,B,B,F,3.0,C,A,B,B,A,B,B,A,A,A,A,A,A,B,B,A,C,A,C,B,A,A
9,170001669940,Jaboatão dos Guararapes,PE,23.0,M,0.0,3,1,2607901.0,Jaboatão dos Guararapes,26.0,PE,1,3,1,,0,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Jaboatão dos Guararapes,PE,1.0,1.0,1.0,1.0,391.0,396.0,400.0,403.0,533.8,427.6,320.7,479.0,BDEAEBDADAEEBCBDABADCECCACEADBDBBCACEAAACAABC,CABDABBCEDBACCBACADBACBACBDACABCEADBECAEBADEB,CBAED99999AACDDACDECACACDADBDBDBDACCEBADBADCDA...,BCBDABCDEBEBADECDCAABCEEBDEBDBDCEBACADCDBEABA,0,DEEBDABCBBEDDCBABCADECEBAADAAECDCBCCDACDEEAAE,CDEAEECAEBDBDBBAECDAEBCCCDAEBEABDBEDAADBCDDEC,EDDCDBEEDECCEBDAEDAEDAABEDBBADEDEDDABAABBDCBEA...,ADBCCECBBDBAEBBDDDABDCCDEDECBEACDAEAABBACEECD,1.0,80.0,100.0,80.0,60.0,0.0,320.0,A,E,A,A,3.0,B,B,B,C,A,A,B,A,B,A,B,A,A,B,B,A,D,B,B,B,A,A
11,170001668485,Salvador,BA,26.0,F,0.0,0,1,2927408.0,Salvador,29.0,BA,1,7,1,,0,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Salvador,BA,1.0,1.0,1.0,1.0,392.0,396.0,400.0,404.0,613.2,616.7,623.1,656.5,DDEBBEACEADDDCADEBDACDBECBEDABBDBEAEBAEDECCDE,DACAEDCDDCDBEEBACCEBBBDCDBBEDEDBEDBDAADECDBEC,EDDCD99999BCEADCEDAEDAADEDBEADADEDDBCAADEDCBEA...,AECCCEBCDBABDDADBADBEEABBECDCEDEEBBCACEAEBCBD,0,BCBBEBAAEDDCBDADEEBADECDCBCDAAEABCEEAAECDCCDA,CDEAEECAEBDBDBBAECDAEBCCCDAEBEABDBEDAADBCDDEC,EDDCDBEEDECCEBDAEDAEDAABEDBBADEDEDDABAABBDCBEA...,CCECEECDADBBDBBDBAEBDDABECBDCCDEDBBACAEADAEAC,1.0,140.0,120.0,120.0,120.0,60.0,560.0,G,F,D,B,6.0,J,A,D,D,B,B,B,B,B,B,B,A,A,D,A,B,E,B,E,B,A,D
13,170001665202,Itarema,CE,24.0,M,0.0,1,1,2307650.0,Maracanaú,23.0,CE,1,7,1,,0,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Itarema,CE,1.0,1.0,1.0,1.0,391.0,397.0,402.0,403.0,501.8,566.6,544.9,645.1,BACEADBACDEDDABECECDBABCAAEDBDCBBDCCACEDAABDD,ACDEBBDCECCDBACBEAAEBCDBBDCEDECAEDCEBECAABDCA,99999CDBDCBEEAEADEADCDBACABEEEADAADBCACBEDEAAB...,EBACEACCDABEBBBDDDABBCEDCDDDDCCBCBEEDBEABADBD,1,DEEBDABCBBEDDCBABCADECEBAADAAECDCBCCDACDEEAAE,ECAEBCDDECADBEABDDBDBDAEBCCCCDEAEDAEBBEDABAEC,DDCDEEDBEECEBADEDEABEDBBBDBCCABAAAABCBDAEDEDDB...,ADBCCECBBDBAEBBDDDABDCCDEDECBEACDAEAABBACEECD,1.0,160.0,120.0,120.0,120.0,100.0,620.0,B,E,A,F,5.0,E,A,C,D,A,B,B,A,A,A,A,A,A,C,A,A,E,B,B,B,A,A
14,170001665203,Fortaleza,CE,24.0,M,0.0,2,1,2304400.0,Fortaleza,23.0,CE,1,3,1,,0,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fortaleza,CE,1.0,1.0,1.0,1.0,392.0,395.0,399.0,404.0,462.5,523.6,571.5,517.6,ACCBAAECEECBEAABDCAEDDDCDABBBCAADBCACACABCBCD,ECDDCAAEEABABBEAABAADCBEBABEBCDCCBAEABBCADBEA,DDBEE99999BEDDCAABCDCDAECADEBEEBDCCEAEBDEABCAC...,AACACADDBECBECDEEDBBACACBBBCADDCCACBDEBBABCBD,0,BCBBEBAAEDDCBDADEEBADECDCBCDAAEABCEEAAECDCCDA,CDDECADBEABDBEDAECAEBDAEBAEDBDBBAECDAEBCCCCDE,DDCDEEDBEEBDAEDAABCECDAEBADEDEDBBBDEABBCCABAAE...,CCECEECDADBBDBBDBAEBDDABECBDCCDEDBBACAEADAEAC,1.0,160.0,120.0,120.0,140.0,80.0,620.0,B,B,C,B,5.0,C,A,B,B,A,A,B,A,A,A,A,A,A,B,A,A,A,A,B,B,A,A


## 5) Show me the shape of your data

In [11]:
nordeste.shape

(2223044, 132)

## 6) Show me the proportion of missing observations for each column of your data

In [12]:
nordeste.isnull().mean()

NU_INSCRICAO                  0.00000000
NO_MUNICIPIO_RESIDENCIA       0.00000000
SG_UF_RESIDENCIA              0.00000000
NU_IDADE                      0.00001170
TP_SEXO                       0.00000000
TP_ESTADO_CIVIL               0.03730201
TP_COR_RACA                   0.00000000
TP_NACIONALIDADE              0.00000000
CO_MUNICIPIO_NASCIMENTO       0.03655438
NO_MUNICIPIO_NASCIMENTO       0.03655438
CO_UF_NASCIMENTO              0.03655438
SG_UF_NASCIMENTO              0.03655438
TP_ST_CONCLUSAO               0.00000000
TP_ANO_CONCLUIU               0.00000000
TP_ESCOLA                     0.00000000
TP_ENSINO                     0.77158707
IN_TREINEIRO                  0.00000000
CO_ESCOLA                     0.77158527
CO_MUNICIPIO_ESC              0.77158842
NO_MUNICIPIO_ESC              0.77158842
CO_UF_ESC                     0.77158842
SG_UF_ESC                     0.77158842
TP_DEPENDENCIA_ADM_ESC        0.77158842
TP_LOCALIZACAO_ESC            0.77158842
TP_SIT_FUNC_ESC 

## 7) Give me a problem statement.
Below, write a problem statement. Keep in mind that your task is to tease out relationships in your data and eventually build a predictive model. Your problem statement can be vague, but you should have a goal in mind. Your problem statement should be between one sentence and one paragraph.

_Brazil's traditional college entrance exam, the "Vestibular", has been criticized for giving an unfair advantage to wealthy, private students, and in 1998, a new exam, "ENEM", was created. In 2009, a revised format known as the "New ENEM" was initiated with the goal of democratizing access to higher education.Our task is to help the federal government analyze the 2017 exam results to find any trends regarding the demographics of students who took the exam (among other factors: age, race, sex, marital status, disability status, income, education type, state, region, etc.). Through this data, we will develop a predictive model to see how influential demographic factors are in accurately predicting scores. This information may be useful in guiding future government actions to more precisely support students who may be disadvantaged._

## 8) What is your _y_-variable?
For Part C of your capstone, you will need to perform a statistical model. This means you will have to accurately predict some y-variable for some combination of x-variables. From your problem statement in part 7, what is that y-variable?

My y-variable is the predicted exam score for a student based on various attributes (x-variables).