<a href="https://colab.research.google.com/github/abroniewski/Child-Wasting-Prediction/blob/Sprint2/notebooks/variable_exploration_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [471]:
#!pip install pandas-profiling==3.3.0
#!pip install openpyxl
#!pip install ipywidgets

In [472]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

Data used by the ZHL
- district (encoded)
- previous prevalence (6-month lag)
- covid
- ndvi
- ipc
- cropdiv
- population
- month
- next prevalence (6 months into the future) (target variable)


----
Districts with missing values in any of these columns are dropped, along with a few districts that have missing observations (rows). This leaves 33 districts with 7 observations each for a total of 231 observations.


## Read data

In [473]:
risk_factors_df = pd.read_csv("https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/ZHL/FSNAU_riskfactors.csv")
admissions_df = pd.read_csv("https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/ZHL/admissions.csv")
area_data_df = pd.read_excel("https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/ZHL/area_data.xlsx")
conflict_df = pd.read_csv("https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/ZHL/conflict.csv")
covid_df = pd.read_csv("https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/ZHL/covid.csv")
ipc_df = pd.read_csv("https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/ZHL/ipc.csv")
ipc2_df = pd.read_csv("https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/ZHL/ipc2.csv")
locations_df = pd.read_csv("https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/ZHL/locations.csv")
prevalence_estimates_df = pd.read_csv("https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/ZHL/prevalence_estimates.csv")
productions_df = pd.read_csv("https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/ZHL/production.csv")

In [474]:
# function to fill the missing values, to be filled after groupby each distrct with linear interpolation 
# df format - 1st column district, remaining columns numeric values
def fill_missing_vals(df):
    filling_cols = list(df.columns)[1:]
    
    # the converting all columns to numeric except distrcit, if error put np.NaN value 
    df[filling_cols] = df[filling_cols].apply(pd.to_numeric, errors='coerce')
    

    # missing_df_temp_1[filling_cols] = missing_df_temp_1.groupby('district').apply(lambda group: group.interpolate())
    # for each group interpolate, then bfill, if still empty put 0
    for name,group in df.groupby('district'):
        group = group.interpolate().bfill().ffill().fillna(0)
        df.loc[df['district'] ==  name] = group

    return df

## Risk Factors

**date**: date of collection \
**district**: name of district \
**rainfall**: rainfall in district \
**ndvi_score**: Normalized Difference Vegetation Index ranging from -1 to 1. [Link to understand better](https://gisgeography.com/ndvi-normalized-difference-vegetation-index/) \
**Price of water, Sorghum, Maize, Red Rice**: prices of water and these crops \
**New Admissions (GAM)**: new admissions of Global Acute Malnutrition \
**Measles cases**: new cases of measles \
**AWD/cholera cases**: New Acute Water Diarrhea/cholera cases. Acute Watery Diarrhea (AWD) \
**AWD/cholera deaths**: New Acute Water Diarrhea/cholera deaths \
**Malaria Cases**: new malaria cases \
**Insecurity - Incidents**: cases of violent and threatening incidents affecting
aid operations, civilians, education, healthcare, refugees and IDPs \
**Insecurity - Fatalities**: deaths as a result of insecurities \
**Displacement (arrivals)**: number of people arriving in the district \
**Displacement (departures)**: number of people leaving the district \
**Total alarm**: total number of alarms 2

2 Alarms refers to a measurement, related to climate, market, nutrition and health, that exceed an established
threshold.

In [475]:
risk_factors_df.head()

Unnamed: 0,date,district,rainfall,ndvi_score,Price of water,Sorghum prices,Maize prices,Red Rice prices,New Admissions (GAM),Measles Cases,AWD/cholera cases,AWD/Cholera deaths,Malaria Cases,Insecurity - Incidents,Insecurity - Fatalities,Displacement (arrivals),Displacement (departures),Total alarms
0,2021-12-01,Mogadishu,0.27,0.22,,9.95,13.95,13.5,10.271,160.0,937.0,,7.0,51.0,42.0,56300 - 56400,9900 - 10000,12
1,2021-12-01,Afgooye,2.33,0.35,,13.5,13.5,16.0,1.649,,22.0,,,47.0,26.0,400 - 500,3000 - 3100,7
2,2021-12-01,Bossaso,2.79,0.18,15.0,48.0,,28.0,1.257,,,,2.0,4.0,20.0,13700 - 13800,16200 - 16300,9
3,2021-12-01,Balcad,1.06,0.38,,15.0,13.0,17.0,442.0,14.0,,,,5.0,18.0,0 - 100,200 - 300,9
4,2021-12-01,Baydhaba,7.46,0.32,30.0,14.1,16.5,17.0,5.411,162.0,165.0,,44.0,13.0,14.0,4200 - 4300,6500 - 6500,14


In [476]:
# missing values
risk_factors_df_missing = risk_factors_df.isnull().sum()*100/len(risk_factors_df)
risk_factors_df_missing

date                          0.000000
district                      0.000000
rainfall                      0.000000
ndvi_score                    1.058559
Price of water               46.171171
Sorghum prices               44.729730
Maize prices                 45.720721
Red Rice prices              25.563063
New Admissions (GAM)         18.941441
Measles Cases                63.738739
AWD/cholera cases            78.626126
AWD/Cholera deaths           98.040541
Malaria Cases                61.216216
Insecurity - Incidents       47.409910
Insecurity - Fatalities      47.409910
Displacement (arrivals)      33.423423
Displacement (departures)    11.576577
Total alarms                  0.000000
dtype: float64

--------

**features comment -**

1. rainfall (no missing)
2. nvdi score - used in baseline but dropped, here only 1.7% missing which can be imputed
3. with prices we will keep it seperate
4. total alrams to be used 
5. (arrival - departure) -> increased population - missing values can be imputed
6. drop diseases, so many values missing


##### I guess best way to fill the missing value is to use interpolation (linear,polynomial, etc.) after grouping by each district, and then do feature engineering

In [477]:
risk_factors_df.columns

Index(['date', 'district', 'rainfall', 'ndvi_score', 'Price of water',
       'Sorghum prices', 'Maize prices', 'Red Rice prices',
       'New Admissions (GAM)', 'Measles Cases', 'AWD/cholera cases',
       'AWD/Cholera deaths', 'Malaria Cases', 'Insecurity - Incidents',
       'Insecurity - Fatalities', 'Displacement (arrivals)',
       'Displacement (departures)', 'Total alarms'],
      dtype='object')

In [478]:
risk_factors_df_updated = risk_factors_df[['date', 'district', 'rainfall', 'ndvi_score',
       'Sorghum prices', 'Maize prices', 'Red Rice prices', 'Displacement (arrivals)',
       'Displacement (departures)', 'Total alarms']]
risk_factors_df_updated.head()

Unnamed: 0,date,district,rainfall,ndvi_score,Sorghum prices,Maize prices,Red Rice prices,Displacement (arrivals),Displacement (departures),Total alarms
0,2021-12-01,Mogadishu,0.27,0.22,9.95,13.95,13.5,56300 - 56400,9900 - 10000,12
1,2021-12-01,Afgooye,2.33,0.35,13.5,13.5,16.0,400 - 500,3000 - 3100,7
2,2021-12-01,Bossaso,2.79,0.18,48.0,,28.0,13700 - 13800,16200 - 16300,9
3,2021-12-01,Balcad,1.06,0.38,15.0,13.0,17.0,0 - 100,200 - 300,9
4,2021-12-01,Baydhaba,7.46,0.32,14.1,16.5,17.0,4200 - 4300,6500 - 6500,14


#### Useful information
The data is for each month from 2017 to 2021 (5 years). \
There are 74 districts in this dataset even though according to google Somalia is divided into 90 districts. \
The currency in Somalia is called Somali Shilling (SOS) and as of 10-09-2022 it's equivalent to 0,0018 Euro (1 Somali Shilling, 1 euro is 576,15 SOS). Somalia does not have a minimum wage. \

### Risk factor feature engineering

In [479]:
# risk_factors_df_updated = risk_factors_df_updated.fillna(-1)
risk_factors_df_updated.head()

Unnamed: 0,date,district,rainfall,ndvi_score,Sorghum prices,Maize prices,Red Rice prices,Displacement (arrivals),Displacement (departures),Total alarms
0,2021-12-01,Mogadishu,0.27,0.22,9.95,13.95,13.5,56300 - 56400,9900 - 10000,12
1,2021-12-01,Afgooye,2.33,0.35,13.5,13.5,16.0,400 - 500,3000 - 3100,7
2,2021-12-01,Bossaso,2.79,0.18,48.0,,28.0,13700 - 13800,16200 - 16300,9
3,2021-12-01,Balcad,1.06,0.38,15.0,13.0,17.0,0 - 100,200 - 300,9
4,2021-12-01,Baydhaba,7.46,0.32,14.1,16.5,17.0,4200 - 4300,6500 - 6500,14


In [480]:
def risk_factors_df_updated_processing(risk_factors_df_updated):
    population_change = []
    for i, row in risk_factors_df_updated.iterrows():
        # for prices we may need to first fill NaN using previous data(month and district) using groupby and then compute average
        # i guess averaging is not a good idea, as if the expensive crop data is not there for a date and a district, then average will fall down, which is not correct
        # prices= list(row[['Sorghum prices','Maize prices','Red Rice prices']])
        # prices = [x for x in prices if x!=-1]
        # prices = [float(x) for x in prices]
        # avg_prices = np.average(prices)


        #----------------------------------------------
        # population change = people arrived - people departured, else unknown

        arrival  = row['Displacement (arrivals)']
        depatures = row['Displacement (departures)']

        try:
            a1,a2 = arrival.split('-')
            a = (int(a1) + int(a2) ) // 2

            d1,d2 = depatures.split('-')
            d = (int(d1) + int(d2) ) // 2

            population_change.append(a-d)

        except:
            population_change.append(np.NaN)

    return population_change

In [481]:
population_change = risk_factors_df_updated_processing(risk_factors_df_updated)

In [482]:
risk_factors_df_updated['population_change'] = population_change
risk_factors_df_updated = risk_factors_df_updated.sort_values('date')
risk_factors_df_updated.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  risk_factors_df_updated['population_change'] = population_change


Unnamed: 0,date,district,rainfall,ndvi_score,Sorghum prices,Maize prices,Red Rice prices,Displacement (arrivals),Displacement (departures),Total alarms,population_change
4439,2017-01-01,Jamaame,1.14,0.37,,9.0,14.0,0 - 100,0 - 100,4,0.0
4385,2017-01-01,Bandarbeyla,1.09,0.15,,,16.4,,,2,
4386,2017-01-01,Caluula,1.03,0.21,,,20.0,,,0,
4387,2017-01-01,Iskushuban,0.52,0.14,,,16.8,1000 - 1100,,3,
4388,2017-01-01,Qandala,0.49,0.17,,,,,,2,


In [483]:
missing_cols = ['district','ndvi_score', 'Sorghum prices', 'Maize prices', 'Red Rice prices', 'population_change']
missing_df_temp = risk_factors_df_updated[missing_cols].copy()
filled_df = fill_missing_vals(missing_df_temp)
risk_factors_df_updated[missing_cols] = filled_df

In [484]:
risk_factors_df_updated.head()

Unnamed: 0,date,district,rainfall,ndvi_score,Sorghum prices,Maize prices,Red Rice prices,Displacement (arrivals),Displacement (departures),Total alarms,population_change
4439,2017-01-01,Jamaame,1.14,0.37,0.0,9.0,14.0,0 - 100,0 - 100,4,0.0
4385,2017-01-01,Bandarbeyla,1.09,0.15,0.0,0.0,16.4,,,2,-200.0
4386,2017-01-01,Caluula,1.03,0.21,0.0,0.0,20.0,,,0,0.0
4387,2017-01-01,Iskushuban,0.52,0.14,0.0,0.0,16.8,1000 - 1100,,3,0.0
4388,2017-01-01,Qandala,0.49,0.17,0.0,0.0,0.0,,,2,0.0


In [485]:
risk_factors_df_updated.isnull().sum()*100/len(risk_factors_df_updated)

date                          0.000000
district                      0.000000
rainfall                      0.000000
ndvi_score                    0.000000
Sorghum prices                0.000000
Maize prices                  0.000000
Red Rice prices               0.000000
Displacement (arrivals)      33.423423
Displacement (departures)    11.576577
Total alarms                  0.000000
population_change             0.000000
dtype: float64

In [486]:
risk_factors_df_updated.columns

Index(['date', 'district', 'rainfall', 'ndvi_score', 'Sorghum prices',
       'Maize prices', 'Red Rice prices', 'Displacement (arrivals)',
       'Displacement (departures)', 'Total alarms', 'population_change'],
      dtype='object')

In [487]:
risk_factors_df_updated = risk_factors_df_updated[['date', 'district', 'rainfall', 'ndvi_score', 'Sorghum prices','Maize prices', 'Red Rice prices','Total alarms']]
risk_factors_df_updated.head()

Unnamed: 0,date,district,rainfall,ndvi_score,Sorghum prices,Maize prices,Red Rice prices,Total alarms
4439,2017-01-01,Jamaame,1.14,0.37,0.0,9.0,14.0,4
4385,2017-01-01,Bandarbeyla,1.09,0.15,0.0,0.0,16.4,2
4386,2017-01-01,Caluula,1.03,0.21,0.0,0.0,20.0,0
4387,2017-01-01,Iskushuban,0.52,0.14,0.0,0.0,16.8,3
4388,2017-01-01,Qandala,0.49,0.17,0.0,0.0,0.0,2


## Admissions

date: date of collection \
district: name of district \
MAM_admissions: Moderate Acute Malnutrition admissions \
SAM_admissions: Severe Acute Malnutrition admissions

In [488]:
admissions_df.head(5)

Unnamed: 0,date,district,MAM_admissions,SAM_admissions
0,2019-01-01,Afgooye,1547.0,903.0
1,2019-02-01,Afgooye,2306.0,997.0
2,2019-03-01,Afgooye,2299.0,485.0
3,2019-04-01,Afgooye,2928.0,274.0
4,2019-05-01,Afgooye,843.0,708.0


--------

**Features comment -**

1. here data for **66 district** from 1-1-2019 to 1-11-2021 divided between MAM and SAM where as GAM = MAM + SAM (we are predicting GAM)
2. We already have GAM data from 2017 to 2021 and we are predicting GAM only so no takeaway from admission data 
3. To study time series components this one can be used to identify pattern/trend/seasionality 

### Useful Information

The data covers almost 3 years from Jan 2019 to Nov 2021

## Area Data

In [489]:
area_data_df.head(5)

Unnamed: 0,district,area,ruggedness,cropland_pct,pasture_pct
0,adan yabaal,3982.821,11680.92,16.09766,68.17391
1,afgooye,3962.774,12945.94,35.59959,71.17021
2,afmadow,26879.61,4109.924,19.18168,92.18092
3,baardheere,15343.18,34425.66,0.855787,95.43716
4,badhaadhe,9801.463,2724.839,60.39216,41.10714


Contains information about the size of different districts in Somalia, along with information about the soil in that district (ruggedness, usage in farming).
\
district: name of district \
area: size of land (km2) \
ruggedness: ruggedness index of the soil (m2) \
cropland_pct: percentage of land that is cropland \
pasture_pct: percentage of land that is pasture

--------

**Features comment -**

1. here data for **74 districts** 
2. Static data which can be used for research/analysis, not for models

## Conflict

In [520]:
conflict_df.head()

Unnamed: 0,date,district,n_battles,n_explosions,n_protests,n_riots,n_strategicdev,n_violcivilians,n_conflict_total
0,2019-01-01,Adan Yabaal,1.0,,,,2.0,,3.0
1,2019-04-01,Adan Yabaal,1.0,,,,,2.0,3.0
2,2019-05-01,Adan Yabaal,1.0,,,,1.0,,2.0
3,2019-06-01,Adan Yabaal,1.0,,,,,,1.0
4,2019-07-01,Adan Yabaal,,,,,,3.0,3.0


In [522]:
# violence in district Marka in 2021
conflict_df[conflict_df['district']=='Afgooye'].tail(12)

Unnamed: 0,date,district,n_battles,n_explosions,n_protests,n_riots,n_strategicdev,n_violcivilians,n_conflict_total
65,2020-11-01,Afgooye,21.0,4.0,,,,1.0,26.0
66,2020-12-01,Afgooye,31.0,1.0,,,,1.0,33.0
67,2021-01-01,Afgooye,29.0,5.0,,,,1.0,35.0
68,2021-02-01,Afgooye,29.0,2.0,,,,4.0,35.0
69,2021-03-01,Afgooye,16.0,3.0,,,1.0,1.0,21.0
70,2021-04-01,Afgooye,21.0,3.0,,,,1.0,25.0
71,2021-05-01,Afgooye,24.0,2.0,,,,1.0,27.0
72,2021-06-01,Afgooye,22.0,3.0,,,,1.0,26.0
73,2021-07-01,Afgooye,16.0,4.0,,,1.0,4.0,25.0
74,2021-08-01,Afgooye,20.0,1.0,,,,1.0,22.0


The total number of conflicts per district, which is the sum of several conflict types such as armed battles, explosions, protests, and riots. Recorded per month over different periods of time. 

date: date of collection \
district: name of district \
n_battles: number of battles recorded on current date \
n_explosions: number of explosions \
n_protests: number of protests \
n_riots: number of riots \
n_strategicdev: number of strategic developments 1 \
n_violcivilians: counts of violence against civilians \
n_conflict_total: total number of conflicts \

1 strategic developments, include incidences of looting, peace-talks, high profile arrests, non-violent transfers of territory, recruitment into non-state groups etc

In [491]:
conflict_df_missing = conflict_df.isnull().sum()*100/len(conflict_df)
conflict_df_missing

date                 0.000000
district             0.000000
n_battles           31.181928
n_explosions        56.151674
n_protests          84.509883
n_riots             94.070190
n_strategicdev      78.902783
n_violcivilians     53.691004
n_conflict_total     0.000000
dtype: float64

--------

**Features comment -**

1. Most of the data is missing, although total number of conflicts seems the correct one to use
2. Here it is assumed that NaN means 0, as the total conflict count is based on that

**here also we can do interpolation for no of battels and can recompute total conflicts if NaN means missing**

In [492]:
conflict_df.columns

Index(['date', 'district', 'n_battles', 'n_explosions', 'n_protests',
       'n_riots', 'n_strategicdev', 'n_violcivilians', 'n_conflict_total'],
      dtype='object')

In [493]:
conflict_df_upated = conflict_df[['date', 'district', 'n_conflict_total']]
conflict_df_upated

Unnamed: 0,date,district,n_conflict_total
0,2019-01-01,Adan Yabaal,3.0
1,2019-04-01,Adan Yabaal,3.0
2,2019-05-01,Adan Yabaal,2.0
3,2019-06-01,Adan Yabaal,1.0
4,2019-07-01,Adan Yabaal,3.0
...,...,...,...
2474,2017-06-01,Zeylac,1.0
2475,2018-02-01,Zeylac,1.0
2476,2018-09-01,Zeylac,1.0
2477,2018-10-01,Zeylac,1.0


## Covid

In [494]:
covid_df.head(5)

Unnamed: 0,date,new_cases,new_deaths
0,2020-03-01,5.0,0.0
1,2020-04-01,596.0,28.0
2,2020-05-01,1375.0,50.0
3,2020-06-01,948.0,12.0
4,2020-07-01,288.0,3.0


The number of covid cases and deaths caused by covid on a national level.

date: date of collection \
new_cases: new cases recorded on current date \
new_deaths: new covid related deaths \

--------

**Features comment -**

1. covid cases from 1-3-2020 to 1-1-2022
2. can be used directly

**Note -** 
should we use covid cases and other variables (which are constant for the whole country - GDP,population growth, etc ) in a new model to understand the their effect ??

## IPC

In [495]:
ipc_df.head(5)

Unnamed: 0,date,area,level1_name,population_analysed,area_phase,phase1_n,phase1_perc,phase2_n,phase2_perc,phase3_n,...,proj_phase2_n,proj_phase2_perc,proj_phase3_n,proj_phase3_perc,proj_phase4_n,proj_phase4_perc,proj_phase5_n,proj_phase5_perc,proj_phase3plus_n,proj_phase3plus_perc
0,2021-01-01,Baki,Awdal,99157,2,63957,0.65,21300,0.23,10000.0,...,25781,0.26,11899.0,0.12,4958.0,0.05,0.0,0.0,17000,0.17
1,2021-01-01,Borama,Awdal,453434,2,305534,0.67,97800,0.25,46200.0,...,116800,0.26,67300.0,0.15,5500.0,0.01,0.0,0.0,72800,0.16
2,2021-01-01,Lughaye,Awdal,99157,2,64457,0.65,19800,0.33,10400.0,...,24900,0.25,19800.0,0.2,12800.0,0.13,0.0,0.0,32600,0.33
3,2021-01-01,Zeylac,Awdal,72825,2,47425,0.65,14500,0.34,7500.0,...,18200,0.25,14200.0,0.19,9900.0,0.14,0.0,0.0,24100,0.33
4,2021-01-01,Ceel barde,Bakool,50827,2,41227,0.81,5300,0.14,4000.0,...,7300,0.14,4800.0,0.09,600.0,0.01,0.0,0.0,5400,0.1


Data on IPC for different districts. IPC (Integrated food Phase Classification) is a method which integrates several sources of evidence into a single score on a five-point scale where 1 indicates no food insecurity, and 5 indicates a famine. 

Date: date of collection \
area: name of the district \
level1_name: region of the district \
population_analyzed: number of people that are analyzed (may not reflect the whole population) \
area_phase: phase with the highest number of people \
phase_n variables (phase1_n,phase2_n…): number of people in the phase, except for phase3plus_n, which represents the number of people in phases 3-5. 

--------

**Features comment -**

1. date from 2017 to 1-1-2021
2. Percentage of people in phase 3 to 5 is correct feature as used in baseline.py, as it is a sample data and we need proportion that we need to use. although it is not for under 5 but still it reflects the population so we assume if x% population in phase3_plus_n, x% childs are in phase3_plus_n
3. 74 district with date from 1-1-2017 to 1-1-2021

In [496]:
ipc_df.columns

Index(['date', 'area', 'level1_name', 'population_analysed', 'area_phase',
       'phase1_n', 'phase1_perc', 'phase2_n', 'phase2_perc', 'phase3_n',
       'phase3_perc', 'phase4_n', 'phase4_perc', 'phase5_n', 'phase5_perc',
       'phase3plus_n', 'phase3plus_perc', 'proj_population_analyzed',
       'proj_area_phase', 'proj_analysis_period', 'proj_phase1_n',
       'proj_phase1_perc', 'proj_phase2_n', 'proj_phase2_perc',
       'proj_phase3_n', 'proj_phase3_perc', 'proj_phase4_n',
       'proj_phase4_perc', 'proj_phase5_n', 'proj_phase5_perc',
       'proj_phase3plus_n', 'proj_phase3plus_perc'],
      dtype='object')

In [497]:
ipc_df_updated = ipc_df[['date', 'area', 'level1_name', 'phase3plus_n', 'phase3plus_perc']]
ipc_df_updated.head()

Unnamed: 0,date,area,level1_name,phase3plus_n,phase3plus_perc
0,2021-01-01,Baki,Awdal,36688,0.37
1,2021-01-01,Borama,Awdal,50100,0.11
2,2021-01-01,Lughaye,Awdal,14900,0.15
3,2021-01-01,Zeylac,Awdal,10900,0.15
4,2021-01-01,Ceel barde,Bakool,4300,0.09


## Locations

In [498]:
locations_df.head(5)

Unnamed: 0,date,district,Average of centy,Average of centx
0,2016-02-01,Adan Yabaal,3.549436,46.54467
1,2016-03-01,Adan Yabaal,3.549436,46.54467
2,2016-04-01,Adan Yabaal,3.549436,46.54467
3,2016-05-01,Adan Yabaal,3.549436,46.54467
4,2016-06-01,Adan Yabaal,3.549436,46.54467


Location data containing information on the average x and y coordinates of the center of the districts.

Date: date of measurement \
District: name of district \
Average of centy: average latitude of centroids \
Average of centx: average longitude of centroids \

--------

**Features comment -**

1. just the location data

## Prevalence Estimates

In [499]:
prevalence_estimates_df.head(5)

Unnamed: 0.1,Unnamed: 0,date,district,total population,Under-Five Population,GAM,MAM,SAM,GAM Prevalence,SAM Prevalence,SAM/GAM ratio
0,0,2021-07-01,Adan Yabaal,,17190.0,4930.0,,710.0,0.286795,0.041303,0.144016
1,1,2021-07-01,Afgooye,,94444.6,43800.0,,8930.0,0.463764,0.094553,0.203881
2,2,2021-07-01,Afmadow,,46703.8,18290.0,,4150.0,0.391617,0.088858,0.2269
3,3,2021-07-01,Baardheere,,34453.4,13330.0,,2230.0,0.386899,0.064725,0.167292
4,4,2021-07-01,Badhaadhe,,14272.6,5790.0,,1330.0,0.405672,0.093186,0.229706


Data containing information on the total cases of wasting of under-five population.

Date: date of collection \
District: name of district \
Total population: population of district \
Under-five population: population under 5 of district \
GAM, MAM, SAM: number of people with General Acute Malnutrition
(MAM + SAM), Severe Acute Malnutrition, and Moderate Acute
Malnutrition, newly admitted for treatment \
GAM prevalence: percentage of children <5 with GAM \
SAM prevalence: percentage of children <5 with SAM \
SAM/GAM ratio: ratio of SAM/GAM 

--------

**Features comment -**

1. target variable df, we will predict GAM 
2. dynamic under 5 population which is correct, population is not fixed
3. After predicting GAM, and while providing final conclusions we should also analyse that if GAM actuall increased w.r.t population, maybe the population of district increased with a large amount, but the GAM only by 1000, which may be very less as compared to increased population of another districts.
4 data is from 1-7-2017 to 1-7-2021 of 87 district. (the max possible dates and districts to analyze) 

In [500]:
prevalence_estimates_df.columns

Index(['Unnamed: 0', 'date', 'district', 'total population',
       'Under-Five Population', 'GAM', 'MAM', 'SAM', 'GAM Prevalence',
       'SAM Prevalence', 'SAM/GAM ratio'],
      dtype='object')

In [501]:
prevalence_estimates_df_updated = prevalence_estimates_df[['date', 'district',
       'Under-Five Population', 'GAM', 'GAM Prevalence']]
prevalence_estimates_df_updated.head()

Unnamed: 0,date,district,Under-Five Population,GAM,GAM Prevalence
0,2021-07-01,Adan Yabaal,17190.0,4930.0,0.286795
1,2021-07-01,Afgooye,94444.6,43800.0,0.463764
2,2021-07-01,Afmadow,46703.8,18290.0,0.391617
3,2021-07-01,Baardheere,34453.4,13330.0,0.386899
4,2021-07-01,Badhaadhe,14272.6,5790.0,0.405672


## Productions

In [502]:
productions_df.head(5)

Unnamed: 0,date,district,Cowpea,Ground Nuts,Maize,Onion,Pepper,Rice,Sesame,Sorghum,Tomato,Water Melon
0,1995-07-01,Adan Yabaal,,,,,,,,220.0,,
1,1995-07-01,Afgooye,,,16.381,,,,,409.0,,
2,1995-07-01,Afmadow,,,,,,,,,,
3,1995-07-01,Baardheere,,,138.0,,,,,5.18,,
4,1995-07-01,Badhaadhe,,,,,,,,,,


Dataset containing information on the production of different crops
per district. Measurements are taken every 6 months (1st of January, 1st of
July).

Date: date of collection \
District: name of district \
Crops (Maize, Cowpea, Ground Nuts,
etc.): Total number of tons produced in specific district 

In [511]:
productions_df_missing = productions_df.isnull().sum()*100/len(conflict_df)
productions_df_missing

date            0.000000
district        0.000000
Cowpea         65.913675
Ground Nuts    96.167810
Maize          30.778540
Onion          91.286809
Pepper         95.804760
Rice           97.377975
Sesame         74.788221
Sorghum        40.701896
Tomato         91.851553
Water Melon    95.482049
cropdiv         0.000000
dtype: float64

In [512]:
productions_df[(productions_df['district']=='Afgooye') & (productions_df['date']=='2007-07-01')].count(axis=1)

1105    5
dtype: int64

--------

**Features comment -**

1. Data for 46 districts only from 1-7-2017 to 1-07-2021, is the data worth to join ?? if thats the case we will only remain with 46 districts only 
2. mostly missing values, can't conclude the total production
3. the way baseline model is using it is, for a date for a district if values are NaN then they assume that crop is not been produced and we count the total no of crop for that date and for that district, seems like the right choice and the only way to use the data. 

In [513]:
productions_df['cropdiv']= productions_df.iloc[:,2:].apply(lambda x:x.count()-1 , axis=1)
productions_df.head()

Unnamed: 0,date,district,Cowpea,Ground Nuts,Maize,Onion,Pepper,Rice,Sesame,Sorghum,Tomato,Water Melon,cropdiv
0,1995-07-01,Adan Yabaal,,,,,,,,220.0,,,1
1,1995-07-01,Afgooye,,,16.381,,,,,409.0,,,2
2,1995-07-01,Afmadow,,,,,,,,,,,0
3,1995-07-01,Baardheere,,,138.0,,,,,5.18,,,2
4,1995-07-01,Badhaadhe,,,,,,,,,,,0


In [514]:
productions_df.columns

Index(['date', 'district', 'Cowpea', 'Ground Nuts', 'Maize', 'Onion', 'Pepper',
       'Rice', 'Sesame', 'Sorghum', 'Tomato', 'Water Melon', 'cropdiv'],
      dtype='object')

In [515]:
productions_df_updated = productions_df[['date', 'district','cropdiv']]
productions_df_updated.head()

Unnamed: 0,date,district,cropdiv
0,1995-07-01,Adan Yabaal,1
1,1995-07-01,Afgooye,2
2,1995-07-01,Afmadow,0
3,1995-07-01,Baardheere,2
4,1995-07-01,Badhaadhe,0
