# Primary Analysis

To be run after data cleaning process

## Info: 
For presenting the data in the paper I’ve found the following improvements that could be made: 
1. Table containing Characteristics of participants: similar to table 2 and 3 in ref[1]
    * I think the results of the regression are disseminated well, but the initial analysis of the population is missing, the reference has some good tables for this. Table 1 and 2 go into some of the details but don’t provide an overview of the population.

[1] Tian, X., Wu, M., Zang, J. et al. Dietary diversity and adiposity in Chinese men and women: an analysis of four waves of cross-sectional survey data. Eur J Clin Nutr 71, 506–511 (2017). https://doi.org/10.1038/ejcn.2016.212


In [1]:
#Imports

import numpy as np
import pandas as pd
import warnings
import os
warnings.filterwarnings("ignore")

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import pandas_profiling as pp

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

data = pd.read_csv('cleaned_survey.csv')
data.sample(5)

Unnamed: 0.1,Unnamed: 0,Community name,Where do you normally cook?,What is the family's monthly income? (Brazil´s Minimum Wage R$937.00),Resident: Permanent,Gender (M or F),Age,Civil state,Education,House,How many meals a day?,. How many people eat at home daily?,How many contribute to the family's monthly income?,Which is the main one?,FuelChoice,Number_Men,Number_Women
9,9,Aruaú,built-in kitchen,1 MW,Y,M,46-55,Married,Incompleted elementary school,Own home,3.0,5.0,2.0,Agriculture,LPG,3,2
121,121,Santo Antônio,built-in kitchen,1 MW,Y,F,46-55,Married,Incompleted high school,Own home,3.0,3.0,2.0,,LPG - Charcoal - Firewood,1,2
146,146,Tiririca,built-in kitchen,1 MW,Y,F,46-55,Married,Incompleted elementary school,Own home,3.0,2.0,2.0,Agriculture,LPG - Charcoal - Firewood,1,1
54,54,Monte Sinai,Outdoor kitchen,,Y,F,37-45,Other union,Incompleted elementary school,Own home,2.0,10.0,,Agriculture,Charcoal - Firewood,7,3
37,37,Marajá,built-in kitchen,1-2 MW,Y,F,>56,Single,Incompleted elementary school,Own home,2.0,2.0,2.0,Retirement,LPG - Charcoal - Firewood,3,0


In [2]:
def mean_string(m, s):
    return '{:.{sig}f}±{:.2g}'.format(m, s, sig=-int(np.floor(np.log10(s))))

def pct_string(p):
    return '{:.1f}%'.format(p * 100.)

#get fuel types
int_df = data.copy()
print("Fuel Choices by Count:\n",int_df['FuelChoice'].value_counts())

fuels = ["LPG - Firewood","LPG - Charcoal - Firewood","LPG - Charcoal","LPG"]
#Onlt taking ones shown later in Multinomial stuff
numeric_cols = ['How many meals a day? ','. How many people eat at home daily? ', "How many contribute to the family's monthly income? ", 'Number_Men', 'Number_Women']
categorical = ['Where do you normally cook? ','Gender (M or F)',"Age",'Civil state','Education','Which is the main one?']

#unique
# 'Resident:  Permanent ', 
# 'House', , 

#Create empty DF
dict_template = {x:[] for x in ["Variable name","Total Population"]+fuels}
table_df = pd.DataFrame.from_dict(dict_template)

#Headers for Categorical
for cat in categorical:
    table_df.loc[cat,"Variable name"] = cat
    
for fuel in fuels:
    temp_df = int_df[int_df['FuelChoice']==fuel]
    for numeric in numeric_cols:
        table_df.loc["1_"+numeric,fuel] = mean_string(temp_df[numeric].mean(),
                                                     temp_df[numeric].std())
        #"%.2f ±%.3f"%(temp_df[numeric].mean(),temp_df[numeric].std())
        table_df.loc["1_"+numeric,"Variable name"] = numeric

    for cat in categorical:
        total = len(temp_df[cat].dropna())
        for unique_cat in int_df[cat].dropna().unique():
            count_cat = len(temp_df[temp_df[cat]==unique_cat][cat].dropna())
            table_df.loc[cat+"_"+str(unique_cat),fuel] = pct_string(count_cat/total)
            #"%.1f%s"%(count_cat/total*100.0,"%")
            table_df.loc[cat+"_"+str(unique_cat),"Variable name"] ="qquad "+str(unique_cat)
    
    table_df.loc['Resident:  Permanent ',fuel] = pct_string(temp_df['Resident:  Permanent '].dropna().count()/temp_df.shape[0])
    table_df.loc['Resident:  Permanent ',"Variable name"] = 'Resident:  Permanent '
    
    table_df.loc['House',fuel] = pct_string(temp_df[temp_df['House']=="Own home"]['House'].count()/temp_df['House'].dropna().count())
    table_df.loc['House',"Variable name"] = 'House'

#Total
for numeric in numeric_cols:
    table_df.loc["1_"+numeric,"Total Population"] = mean_string(int_df[numeric].mean(),
                                                               int_df[numeric].std())
    #"%.2f ±%.3f"%(int_df[numeric].mean(),int_df[numeric].std())

for cat in categorical:
    total = len(int_df[cat].dropna())
    for unique_cat in int_df[cat].dropna().unique():
        count_cat = len(int_df[int_df[cat]==unique_cat][cat].dropna())
        table_df.loc[cat+"_"+str(unique_cat),"Total Population"] = pct_string(count_cat/total)

table_df.loc['Resident:  Permanent ',"Total Population"] = pct_string(int_df['Resident:  Permanent '].dropna().count()/int_df.shape[0])
table_df.loc['Resident:  Permanent ',"Variable name"] = 'Resident:  Permanent '

table_df.loc['House',"Total Population"] = pct_string(int_df[int_df['House']=="Own home"]['House'].count()/int_df['House'].dropna().count())
table_df.loc['House',"Variable name"] = 'House'

table_df.sort_index(inplace=True)  

#Fix Names
change_name = {
 'Resident:  Permanent ' : "Residence",
 'Gender (M or F)': "Gender of Household Head",
 'Age' : "Age of Household Head",
 'Education': "Education of Household Head",
 'How many meals a day? ' : "Number of meals per day",
 '. How many people eat at home daily? ': "Number of people at meals daily",
 "How many contribute to the family's monthly income? ": "Number of people contributing to the monthly income",
 'Civil state':"Civil Status of Household Head",
 'House':"House Ownership",
  "Number_Men":"Number of Men in Household",
  "Number_Women":"Number of Women in Household",
  "Where do you normally cook? _3.0":"Normal Cooking Area"
}

for change in change_name:
    table_df.loc[table_df['Variable name']==change,'Variable name']= change_name[change]

display(table_df)    

with pd.option_context("max_colwidth", 1000):
    string_tex = table_df.to_latex(header = False, multirow=True,index=False)

# add_header = """ 
# \\begin{tabular}{|p{5cm}|p{1.4cm}p{1.4cm}|p{1.4cm}p{1.4cm}|p{1.4cm}}
#                 \\toprule
# 				Variable name & Total Population & LPG - Firewood & LPG - Charcoal - Firewood & LPG - Charcoal & LPG \\\\
# 				\midrule
# """

add_header = """
\\begin{tabular}{p{4.5cm}ccccc}
                \\toprule
				Variable name & 
				\\multicolumn{1}{p{1.7cm}}{\\centering\\bfseries  Total Population }& 
				\\multicolumn{1}{p{1.7cm}}{\\centering\\bfseries  LPG - Firewood }& 
				\\multicolumn{1}{p{1.7cm}}{\\centering\\bfseries  LPG - Charcoal - Firewood }& 
				\\multicolumn{1}{p{1.7cm}}{\\centering\\bfseries  LPG - Charcoal} & 
				\\multicolumn{1}{p{1.7cm}}{\\centering\\bfseries LPG} \\\\
				\\midrule


"""
string_tex = string_tex.replace("qquad","\\qquad")

string_tex = add_header + string_tex.split("toprule")[1]

string_tex = string_tex.replace("±","$\\pm$")
string_tex = string_tex.replace("NaN"," ")

#Creates tex table that was used in the paper
with open("table_desc_format.tex", "wb") as text_file: 
    text_file.write(string_tex.encode('utf8')) 
        

Fuel Choices by Count:
 LPG - Charcoal - Firewood    84
LPG - Firewood               33
LPG                          32
LPG - Charcoal               16
Charcoal                      5
Nan                           4
Charcoal - Firewood           3
Firewood                      2
Name: FuelChoice, dtype: int64


Unnamed: 0,Variable name,Total Population,LPG - Firewood,LPG - Charcoal - Firewood,LPG - Charcoal,LPG
1_. How many people eat at home daily?,Number of people at meals daily,4±2.5,5±3.4,4±2.2,3±1.7,3±1.8
1_How many contribute to the family's monthly income?,Number of people contributing to the monthly i...,1.8±0.8,2.0±0.98,1.8±0.82,1.8±0.68,1.6±0.62
1_How many meals a day?,Number of meals per day,3.3±0.83,3.2±0.79,3.4±0.77,3±1.3,3.2±0.75
1_Number_Men,Number of Men in Household,2±1.4,3±1.4,2±1.2,2±1.1,2±1.4
1_Number_Women,Number of Women in Household,2±1.1,2±1.1,2±1.2,2±1.4,1.5±0.98
Age,Age of Household Head,,,,,
Age_15-25,qquad 15-25,18.1%,12.1%,22.6%,18.8%,15.6%
Age_26-36,qquad 26-36,27.7%,12.1%,34.5%,31.2%,21.9%
Age_37-45,qquad 37-45,22.0%,42.4%,15.5%,18.8%,12.5%
Age_46-55,qquad 46-55,16.9%,12.1%,14.3%,6.2%,34.4%
