## Information

Title: Costa Rican Household Poverty Level Prediction

Website: https://www.kaggle.com/c/costa-rican-household-poverty-prediction

Purpose:
- *About*: Many social programs have a hard time making sure the right people are given enough aid. It’s especially tricky when a program focuses on the poorest segment of the population. The world’s poorest typically can’t provide the necessary income and expense records to prove that they qualify.*
- *In Latin America, one popular method uses an algorithm to verify income qualification. It’s called the Proxy Means Test (or PMT). With PMT, agencies use a model that considers a family’s observable household attributes like the material of their walls and ceiling, or the assets found in the home to classify them and predict their level of need.*

Scoring Metric: Accuracy
- Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:
- Formula: $(TP + TN)/(TP + FP + TN + FN)$
    - TP: True Positives
    - FP: False Positives
    - TN: True Negatives
    - FN: False Negatives

## About Data

Data is provided by Kaggle.com: https://www.kaggle.com/c/costa-rican-household-poverty-prediction/data


## Lessons Learned

1. It's important to understand `dtypes`, this can cause some models to work or not work as efficient.
2. Creating a color dictionary for target value will keep the value consistent throughout the data.
3. `groupby` is becoming one of the most important features in pandas. It helps structure the data for analysis and visualizations.
4. Missing values can be tricky. Missing values in one feature could actually be represented in another feature.
5. Function: `get_categorical_percent()`
    - Useful function to understand the distribution of the target variable.
    - Location of code: "/Users/alexguanga/All_Projects/ds-portfolio/DataScience_Framework/DataScience_Code/categorical_percent.py"
6. `feature importance` is a useful library to understand the influence of variables on the models!
7. `cumulative feature importance` can also shed light on how many feature are needed!
8. It's good practice to plot the score results, it can help us visualize if we are improving our models
9. `Recursive Feature Elimination` can help us select the important features
10. This should be further explored but `macro_f1` score was used to improve the model.
    - F1 = 2 * (precision * recall) / (precision + recall)
    - Precision = tp/(tp+fp)
        - The relation between true positives and the total of true positives and false positives.
        - Accuracy of the predictions (percentage). If we train a model, we can classify as a sunny day as a positive. True positives will be the times our model correctly predicted sunny day. False positives will be the times our model incorrectly predicted rain. In both occasions, we predicted rain but we only care about the ratio we got it correct.
    - Recall = tp/(tp+fn)
        - The relation between true positives to the total number of true positives and false negatives.
        - Recall looks at the actual ground truths. If we train a model. we can classify a sunny day as a positive. True positives will be the times our model correctly predicted sunny day. False negative will be days where we predicted rainy BUT were actually sunny. Thus, recall could be viewed as the ratio of correct predictions to the times we should have predicted.
11. The library `hyperopt` can help us narrow down the space we perform our modeling. Hence, computation could sed up. We do not have to waste time testing our model through the wrong "search space"
12. Another **good practice** is to save the training models to `pickles` or a csv file.
13. The `get_corrs_matrix` offers a great method to find the inter-correlation among features!
    - PATH = "/Users/alexguanga/All_Projects/ds-portfolio/DataScience_Framework/DataScience_Code/get_corrs_matrix.py"
14. `argmax` is a useful np.function. For example, if we have boolean variables that are mutually inclusive **where order matters**, we can use argmax to return the index of its position based on how the features are calculated.
15. If we encounter `SettingWithCopyWarning`, we can use:
    - `pd.options.mode.chained_assignment = None  # default='warn'`, to disable to error
16. When using `lambda` functions, one neat trick is to assign a name to the lambda function. 
    ```python
    # Creating the range functions
    range_ = lambda x: x.max() - x.min()
    range_.__name__ = 'range_'

    # Group and aggregate
    ind_agg = ind.drop(['Target'], axis=1).groupby('idhogar').agg(['count', 'std', range_])
    ```
17. The `pipeline` function is a great way to produce data manipulation with less coding.
    - For example, we can use `imputer` and `scalar` in the pipeline.
        - `imputer`: How to handle missing values.
        - `scalar`: How to standarized the data.

## Findings

- Overall performance in the 32 percentile.
- Some findings were that variables related to `age` had great importance
- Education also had a high predictive value

## Variables

Important features:
- Id - a unique identifier for each row.
- Target - the target is an ordinal variable indicating groups of income levels. 
    - 1 = extreme poverty 
    - 2 = moderate poverty 
    - 3 = vulnerable households 
    - 4 = non vulnerable households
- idhogar - this is a unique identifier for each household. This can be used to create household-wide features, etc. All rows in a given household will have a matching value for this identifier.
- parentesco1 - indicates if this person is the head of the household.

Binary:
- hacdor, = 1 Overcrowding by bedrooms
- hacapo, =1 Overcrowding by rooms
- v14a, =1 has bathroom in the household
- refrig, =1 if the household has refrigerator
- paredblolad, =1 if predominant material on the outside wall is block or brick
- paredzocalo, "=1 if predominant material on the outside wall is socket (wood,  zinc or absbesto"
- paredpreb, =1 if predominant material on the outside wall is prefabricated or cement
- pareddes, =1 if predominant material on the outside wall is waste material
- paredmad, =1 if predominant material on the outside wall is wood
- paredzinc, =1 if predominant material on the outside wall is zink
- paredfibras, =1 if predominant material on the outside wall is natural fibers
- paredother, =1 if predominant material on the outside wall is other
- pisomoscer, "=1 if predominant material on the floor is mosaic,  ceramic,  terrazo"
- pisocemento, =1 if predominant material on the floor is cement
- pisoother, =1 if predominant material on the floor is other
- pisonatur, =1 if predominant material on the floor is  natural material
- pisonotiene, =1 if no floor at the household
- pisomadera, =1 if predominant material on the floor is wood
- techozinc, =1 if predominant material on the roof is metal foil or zink
- techoentrepiso, "=1 if predominant material on the roof is fiber cement,  mezzanine "
- techocane, =1 if predominant material on the roof is natural fibers
- techootro, =1 if predominant material on the roof is other
- cielorazo, =1 if the house has ceiling
- abastaguadentro, =1 if water provision inside the dwelling
- abastaguafuera, =1 if water provision outside the dwelling
- abastaguano, =1 if no water provision
- public, "=1 electricity from CNFL,  ICE,  ESPH/JASEC"
- planpri, =1 electricity from private plant
- noelec, =1 no electricity in the dwelling
- coopele, =1 electricity from cooperative
- sanitario1, =1 no toilet in the dwelling
- sanitario2, =1 toilet connected to sewer or cesspool
- sanitario3, =1 toilet connected to  septic tank
- sanitario5, =1 toilet connected to black hole or letrine
- sanitario6, =1 toilet connected to other system
- energcocinar1, =1 no main source of energy used for cooking (no kitchen)
- energcocinar2, =1 main source of energy used for cooking electricity
- energcocinar3, =1 main source of energy used for cooking gas
- energcocinar4, =1 main source of energy used for cooking wood charcoal
- elimbasu1, =1 if rubbish disposal mainly by tanker truck
- elimbasu2, =1 if rubbish disposal mainly by botan hollow or buried
- elimbasu3, =1 if rubbish disposal mainly by burning
- elimbasu4, =1 if rubbish disposal mainly by throwing in an unoccupied space
- elimbasu5, "=1 if rubbish disposal mainly by throwing in river,  creek or sea"
- elimbasu6, =1 if rubbish disposal mainly other
- epared1, =1 if walls are bad
- epared2, =1 if walls are regular
- epared3, =1 if walls are good
- etecho1, =1 if roof are bad
- etecho2, =1 if roof are regular
- etecho3, =1 if roof are good
- eviv1, =1 if floor are bad
- eviv2, =1 if floor are regular
- eviv3, =1 if floor are good
- dis, =1 if disable person
- male, =1 if male
- female, =1 if female
- estadocivil1, =1 if less than 10 years old
- estadocivil2, =1 if free or coupled uunion
- estadocivil3, =1 if married
- estadocivil4, =1 if divorced
- estadocivil5, =1 if separated
- estadocivil6, =1 if widow/er
- estadocivil7, =1 if single
- parentesco1, =1 if household head
- parentesco2, =1 if spouse/partner
- parentesco3, =1 if son/doughter
- parentesco4, =1 if stepson/doughter
- parentesco5, =1 if son/doughter in law
- parentesco6, =1 if grandson/doughter
- parentesco7, =1 if mother/father
- parentesco8, =1 if father/mother in law
- parentesco9, =1 if brother/sister
- parentesco10, =1 if brother/sister in law
- parentesco11, =1 if other family member
- parentesco12, =1 if other non family member
- instlevel1, =1 no level of education
- instlevel2, =1 incomplete primary
- instlevel3, =1 complete primary
- instlevel4, =1 incomplete academic secondary level
- instlevel5, =1 complete academic secondary level
- instlevel6, =1 incomplete technical secondary level
- instlevel7, =1 complete technical secondary level
- instlevel8, =1 undergraduate and higher education
- instlevel9, =1 postgraduate higher education
- tipovivi1, =1 own and fully paid house
- tipovivi2, "=1 own,  paying in installments"
- tipovivi3, =1 rented
- tipovivi4, =1 precarious
- tipovivi5, "=1 other(assigned,  borrowed)"
- computer, =1 if the household has notebook or desktop computer
- television, =1 if the household has TV
- mobilephone, =1 if mobile phone
- lugar1, =1 region Central
- lugar2, =1 region Chorotega
- lugar3, =1 region PacÃƒÂ­fico central
- lugar4, =1 region Brunca
- lugar5, =1 region Huetar AtlÃƒÂ¡ntica
- lugar6, =1 region Huetar Norte
- area1, =1 zona urbana
- area2, =2 zona rural
- v18q, owns a tablet
- edjefe, years of education of male head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0
- edjefa, years of education of female head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0


Discrete:
- rooms,  number of all rooms in the house
- v18q1, number of tablets household owns
- r4h1, Males younger than 12 years of age
- r4h2, Males 12 years of age and older
- r4h3, Total males in the household
- r4m1, Females younger than 12 years of age
- r4m2, Females 12 years of age and older
- r4m3, Total females in the household
- r4t1, persons younger than 12 years of age
- r4t2, persons 12 years of age and older
- r4t3, Total persons in the household
- tamhog, size of the household
- tamviv, number of persons living in the household
- escolari, years of schooling
- rez_esc, Years behind in school
- hhsize, household size
- idhogar, Household level identifier
- hogar_nin, Number of children 0 to 19 in household
- hogar_adul, Number of adults in household
- hogar_mayor, # of individuals 65+ in the household
- hogar_total, # of total individuals in the household
- bedrooms, number of bedrooms
- overcrowding, # persons per room
- age, Age in years
- SQBescolari, escolari squared
- SQBage, age squared
- SQBhogar_total, hogar_total squared
- SQBedjefe, edjefe squared
- SQBhogar_nin, hogar_nin squared
- SQBovercrowding, overcrowding squared
- SQBdependency, dependency squared
- SQBmeaned, square of the mean years of education of adults (>=18) in the household
- agesq, Age squared

Continous:
- v2a1, Monthly rent payment
- dependency, Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
- meaneduc,average years of education for adults (18+)
- qmobilephone, # of mobile phones