## We Need to Talk + MIT Code for Good '22
This notebook reads from this [data spreadsheet](https://docs.google.com/spreadsheets/d/1_OsK5jXUoQP0JRrfKKwzCxPS936-Qp3fE6RgNR_a82I/edit#gid=2114958450) that our CFG team has gathered, and utilizes a simple model to calculate period poverty scores across 81 different provinces in Turkey. In the future, as more data is obtained, the model and spreadsheet can be modified to accomodate for these changes.

The following links are helpful to get kickstarted with the Google Sheets API:
- https://developers.google.com/sheets/api/quickstart/python
- https://blog.coupler.io/python-to-google-sheets/

This notebook requires:
- pandas
- google-auth 2.3.3
- google-api-python-client 2.35.0
- google-api-core 2.4.0
- google-auth-oauthlib 0.4.6

In [35]:
from googleapiclient.discovery import build
from google.oauth2 import service_account
from googleapiclient.errors import HttpError
import pandas as pd
import json
import csv
from collections import defaultdict

In [8]:
# Scope to allow read/write to the service account's files
SCOPES = ['https://www.googleapis.com/auth/spreadsheets']
SERVICE_ACCOUNT_FILE = "we-need-to-talk-338617-1bfc415b1e1b.json"

CREDENTIALS = service_account.Credentials.from_service_account_file(SERVICE_ACCOUNT_FILE, scopes=SCOPES)
SPREADSHEET_ID = "1_OsK5jXUoQP0JRrfKKwzCxPS936-Qp3fE6RgNR_a82I"

In [9]:
# Try an example first

SHEET_RANGE = "Data!A1:M82"
try:
    service = build('sheets', 'v4', credentials=CREDENTIALS)

    # Call the Sheets API
    sheet = service.spreadsheets()
    result = sheet.values().get(spreadsheetId=SPREADSHEET_ID, range=SHEET_RANGE).execute()
    values = result.get('values', [])

    if not values:
        raise Exception('No data found.')

except HttpError as err:
    print(err)

In [10]:
df = pd.DataFrame(values[1:], columns=values[0])
df

Unnamed: 0,Province Name,Region,Phone Prefix,Population (2019-2020 Estimate),Number of Menstruators (Estimate),Number of Refugee Menstruators (Estimate),Period Poverty Score
0,Adana,Mediterranean,322,2237940,680581,72648,4.72
1,Adıyaman,Southeastern Anatolia,416,626465,190515,6496,5.47
2,Afyonkarahisar,Aegean,272,729483,221843,3532,2.90
3,Ağrı,Eastern Anatolia,472,536199,163064,357,5.85
4,Aksaray,Central Anatolia,382,416567,126682,1121,3.96
...,...,...,...,...,...,...,...
76,Uşak,Aegean,276,370509,112676,889,3.22
77,Van,Eastern Anatolia,432,1136757,345700,620,6.97
78,Yalova,Marmara,226,270976,82407,1116,1.92
79,Yozgat,Central Anatolia,354,421200,128091,1499,3.45


In [33]:
MONTHLY_MENSTRUAL_COSTS = 200  # Assuming the purchase of pads, units in Turkish Liras (TRY)
AVG_NUM_FEMALES_PER_HOUSEHOLD = 2  # TODO: Remove this + alter model code below if using per person vs per household

def setup_sheets_api_client(creds):
    """
    Returns the sheets api client. 
    The client object can be called as follows to read from a spreadsheet:
    
    client.values().get(spreadsheetId=xyzid, range=xyzrange).execute()
    
    Raises HttpError if the connection fails.
    """
    service = build('sheets', 'v4', credentials=creds)
    return service.spreadsheets()

def add_sheet_data_to_dict(sheets_api_client, sheet_range, data_dict):
    """
    Uses sheets api client to read a specified range from the data spreadsheet
    (Id can be found in the URL: https://docs.google.com/spreadsheets/d/<ID HERE>/edit#gid=blah).
    Adds the data spanning the range to data_dict in the format: 
    {
        Adana: {
            "Region": "Mediterranean", "Population": 200, ...
        },
        Istanbul: {
            "Region": "Marmara", ...
        }, ...
    }
    
    Raises Exception if no data is found, or HttpError if the connection fails.
    """
    result = sheets_api_client.values().get(spreadsheetId=SPREADSHEET_ID, range=sheet_range).execute()
    values = result.get('values', [])

    if not values:
        raise Exception('No data found.')

    df = pd.DataFrame(values[1:], columns=values[0])
    
    for i in range(len(df)):
        province = df.iloc[i]["Province Name"]
        for col in df.columns:
            data_dict[province][col] = df.iloc[i][col]
            
def calculate_period_poverty_per_province(data_dict, c1=2.5, c2=1, c3=4, c4=2.5):
    """
    Calculates and stores period poverty score by province in data_dict.
    Tentative formula:
    
    c1 * di + c2 * sr + c3 * pu + c4 * hc
    where (all values are calculated estimates or data taken from reputable organizations):
        - c1, c2, c3, and c4 are tunable coefficients
        - di := distress index (HDI) for females
        - sr := syrian refugee percentage
        - pu := period unaffordability index (calculated via monthly menstrual expenditures vs monthly income)
        - hc := lack of healthcare index (calculated via availability of hospitals and staff)
        
    Default values for c1, c2, c3, c4 chosen arbitrarily. These can and should be adjusted.
    """
    for province in data_dict:
        di = float(data_dict[province]["Distress Index Females"])
        sr = float(data_dict[province]["Syrian Refugees (%)"].replace("%", "")) / 100  # Remove percent sign
        pu = float(data_dict[province]["Period Unaffordability Index"])
        hc = float(data_dict[province]['Lack of Healthcare Index'])
        
        data_dict[province]["Period Poverty Score"] = c1 * di + c2 * sr + c3 * pu + c4 * hc

#### Notes:
- The Well-Being Index comes from the [Human Development Indices project](https://globaldatalab.org/shdi/2019/gender-development/TUR/?levels=1%2B4&interpolation=1&extrapolation=0&nearest_real=0) with its technical details explained [here](http://hdr.undp.org/sites/default/files/hdr2020_technical_notes.pdf).
- Mean income data taken from [here](https://data.tuik.gov.tr/Bulten/Index?p=Income-and-Living-Conditions-Survey-Regional-Results-2020-37405).
- See the Sources tab in our master spreadsheet for more details on how we obtained the data from international organizations' online public data.

In [109]:
a = len(df_survey[(df_survey["Where are currently residing?"] == '') & (df_survey["Do you menstruate?"] == "Yes")])
print(f"Number of menstruators who declined to say where they were: {a}. Discarding their survey results.")

Number of menstruators who declined to say where they were: 10. Discarding their survey results.


In [123]:
"""
Survey-related data collection. This will be combined with the pps scores calculated using public online data only
"""

def construct_df_survey():
    with open("survey.csv", "r") as f:
        reader = csv.reader(f)
        next(reader)  # Skip header row
        df_survey = pd.DataFrame(reader, columns=next(reader))
    
    return df_survey

def construct_survey_map(df_survey):
    survey_map = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
    for _, row in df_survey.iterrows():
        menstruator = row["Do you menstruate?"] == "Yes"
        # Skip if non-menstruator or if a menstruator response doesn't include their region location
        if not menstruator or menstruator and not row["Where are currently residing?"]:
            continue

        province = row["Where are currently residing?"]
        relevant_cols = [
            "How often do you have an access to soap?",
            "How often do you have an access to clean water?",
            "How often do you have an access to trash bin?",
            "How often do you have an access to a safe/private toilet?",
            "How often do you have an access to a clean toilet paper?",
            "How often do you have an access to a health facility? ",
            "On average, how much money do you spend on menstrual products per month?",
            "Do you experience financial difficulty in purchasing menstrual products?",
            "Have you had a vaginal infection (itching in the vagina, intense discharge, odor, burning during urination, painful sexual intercourse, etc.) in the last year?",
            "How many times have you visited a healthcare facility for a vaginal infection in the past year?",
            "How much is your monthly income? (If you receive financial support from any institution, count that amount as income.)"
        ]

        for col in relevant_cols:
            if row[col]:  # Filter out none and empty string values for the no responses
                survey_map[province][col][row[col]] += 1
    
    return survey_map


def calculate_distress_index(question_stats):
    index = 0
    answer_scores = {
        'Always': 0,
        'Often': 0.25,
        'Rarely': 0.75,
        'Never': 1
    }
    
    questions = [
        'How often do you have an access to soap?',
        'How often do you have an access to clean water?',
        'How often do you have an access to trash bin?',
        'How often do you have an access to a safe/private toilet?',
        'How often do you have an access to a clean toilet paper?',
        'How often do you have an access to a health facility? '
    ]
    
    for q in questions:
        num_responses = sum(v for v in question_stats[q].values())
        for ans in question_stats[q]:
            index += answer_scores[ans] * question_stats[q][ans]/num_responses
            
    # TODO: Can make coefficients here to weight different Qs instead of taking the avg
    # ie weight access to trash cans lower since that's just super basic
    index /= len(questions)
    return index

def calculate_period_unaffordability_index(question_stats):
    spending = question_stats['On average, how much money do you spend on menstrual products per month?']
    income = question_stats['How much is your monthly income? (If you receive financial support from any institution, count that amount as income.)']
    financial_difficulty = question_stats['Do you experience financial difficulty in purchasing menstrual products?']
    
    total_spending_resp = sum(v for v in spending.values())
    # Ignore the 'Bilmiyorum'/'Never' value since it's equivalent to 0 spending (total # resp factors it in)
    avg_spending = sum(
        float(amt) * num_resp / total_spending_resp for amt, num_resp in spending.items() if amt.isnumeric()
    )
    total_income_resp = sum(v for v in income.values())
    avg_income = sum(
        float(amt) * num_resp / total_income_resp for amt, num_resp in income.items() if amt.isnumeric()
    )
    income_exp = 1 if avg_income == 0 else 1 - avg_spending/avg_income 
    
    total_financial_difficulty_resp = sum(v for v in financial_difficulty.values())
    answer_scores = {
        'I always have difficulties': 0,
        'I often have difficulties': 0.25,
        'I rarely have difficulties': 0.75,
        'I never have difficulties': 1
    }
    avg_financial_difficulty = sum(
        answer_scores[ans] * num_resp / total_financial_difficulty_resp for ans, num_resp in financial_difficulty.items()
    )
    
    # People's understanding of their own financial strain is better than our hand-wavy estimates of income vs expenditure
    return 0.25 * income_exp + 0.75 * avg_financial_difficulty
    
def calculate_healthcare_need_index(question_stats):
    infections = question_stats['Have you had a vaginal infection (itching in the vagina, intense discharge, odor, burning during urination, painful sexual intercourse, etc.) in the last year?']
    hospital_visits = question_stats['How many times have you visited a healthcare facility for a vaginal infection in the past year?']
    
    infection_scores = {
        'I am not sure ': 0.3,  # Worrying since this more likely than not means that something's not right
        'No': 1,
        'Yes': 0
    }
    hospital_scores = {
        '0': 1,
        '1': 0.3,  # 1 hospital visit implies that the situation was already really bad
        '2': 0.2,
        '3': 0.1,
        '4+': 0
    }
    
    total_inf_resp = sum(v for v in infections.values())
    avg_infection_score = sum(infection_scores[ans] * num_resp / total_inf_resp for ans, num_resp in infections.items())
    
    total_hosp_resp = sum(v for v in hospital_visits.values())
    avg_hospital_visit_score = sum(
        hospital_scores[ans] * num_resp / total_hosp_resp for ans, num_resp in hospital_visits.items()
    )
    
    # TODO: might want to weight hospital visits more? loses out on those who won't go to the hospital though
    return 0.5 * avg_infection_score + 0.5 * avg_hospital_visit_score

def calculate_period_poverty_score_survey_data(df_survey, survey_map, data_dict, c1=2.5, c2=3.5, c3=4):
    MAX_RESPONDENTS = df_survey["Where are currently residing?"].value_counts().max()
    for province, question_stats in survey_map.items():
        num_resp = len(df_survey[df_survey["Where are currently residing?"] == province])
        di = calculate_distress_index(question_stats)
        pu = calculate_period_unaffordability_index(question_stats)
        hc = calculate_healthcare_need_index(question_stats)
    
        pps_from_survey = c1 * di + c2 * pu + c3 * hc
        data_dict[province]["Period Poverty Score (Survey Data)"] = pps_from_survey
        # TODO: Maybe 0.2 minimum contribution of survey results for a province is still too much
        # especially considering how some have only 1 response per province
        # Consider an exponential function here to determine the weight
        # (Intuition: <20 respondents shouldn't have too much of a weight but hundreds definitely should,
        # obviously wrt the population of the province too)
        data_dict[province]["Period Poverty Score (Combined Weighted)"] = \
            (0.8 - 0.3 * num_resp / MAX_RESPONDENTS) * data_dict[province]["Period Poverty Score"] \
            + (0.2 + 0.3 * num_resp / MAX_RESPONDENTS) * pps_from_survey

In [124]:
if __name__ == "__main__":
    client = setup_sheets_api_client(CREDENTIALS)
    data_dict = defaultdict(dict)
    
    # Add more tabs and columns to this list to change the information processed by the model
    # Some sheets have specific row ranges since there's a summary set of cells at the bottom that we filter out
    relevant_sheet_tab_ranges = [
        "Data!A1:F82",
        "Distress Index!A:G",
        "Syrian Refugees Data!A1:E82",
        "Income/Expenditure!A1:K82",
        "Lack of Health Care!A1:J82"
    ]
    
    # Read in all relevant data into data_dict
    for sheet_range in relevant_sheet_tab_ranges:
        add_sheet_data_to_dict(client, sheet_range, data_dict)

    # Use our model to calculate the period poverty score per region
    calculate_period_poverty_per_province(data_dict)
    
    # ---------------- Survey contribution to the model ----------------
    df_survey = construct_df_survey()
    survey_map = construct_survey_map(df_survey)
    
    # Use survey data to also calculate period povery score per region and merge it with the above score
    calculate_period_poverty_score_survey_data(df_survey, survey_map, data_dict)
    
    # Save results into a json file
    with open("provinces_data.json", "w") as f:
        json.dump(data_dict, f)

In [67]:
# TODO: Might want to factor in those who receive financial aid into the income/expenditure index
print(f"{len(df_survey[df_survey["Do you (or your family) receive financial support from any institution?"] == "Yes"])} out of {len(df_survey)} total respondants received financial support")

214
Out of 2707 total respondants received financial support


In [70]:
survey_map

defaultdict(<function __main__.<lambda>()>,
            {'Ankara': defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'How often do you have an access to soap?': defaultdict(int,
                                      {'Always': 283,
                                       'Often': 60,
                                       'Rarely': 10,
                                       'Never': 1}),
                          'How often do you have an access to clean water?': defaultdict(int,
                                      {'Always': 292,
                                       'Often': 55,
                                       'Never': 2,
                                       'Rarely': 5}),
                          'How often do you have an access to trash bin?': defaultdict(int,
                                      {'Often': 114,
                                       'Always': 226,
                                       'Rarely': 12,
               

In [71]:
survey_map["Ankara"]["How often do you have an access to soap?"]

defaultdict(int, {'Always': 283, 'Often': 60, 'Rarely': 10, 'Never': 1})

In [122]:
[(province, data["Period Poverty Score"], data["Period Poverty Score (Survey Data)"], data["Period Poverty Score (Combined Weighted)"]) for province,data in data_dict.items() if "Period Poverty Score (Survey Data)" in data]

[('Adana', 4.7218, 5.357885608274598, 4.87580926618683),
 ('Adıyaman', 5.479, 4.9043125000000005, 5.36323916547278),
 ('Afyonkarahisar', 2.9071, 6.423074999999999, 3.615332213467049),
 ('Ağrı', 5.854800000000001, 6.052083333333334, 5.8943131948424075),
 ('Aksaray', 3.9719999999999995, 6.271795138888889, 4.434594896131805),
 ('Amasya', 4.0633, 5.540652777777778, 4.360040486310092),
 ('Ankara', 1.1656, 5.515839704665858, 2.476904346506729),
 ('Antalya', 2.484, 5.261100146141601, 3.0815937563760296),
 ('Ardahan', 4.3537, 3.54995, 4.192719699140401),
 ('Aydın', 2.9475, 5.246510714285714, 3.4165245354072864),
 ('Balıkesir', 3.4641, 5.502357142857143, 3.8799278182562427),
 ('Bartın', 3.3362, 4.819194444444444, 3.636623229863101),
 ('Batman', 6.0686, 5.893052083333333, 6.033289215616047),
 ('Bayburt', 4.9279, 6.37495, 5.217724627507163),
 ('Bilecik', 1.4929000000000001, 4.7162083333333324, 2.1412560028653296),
 ('Bingöl', 4.5539, 5.3945, 4.722983438395415),
 ('Bitlis', 6.0285, 4.8436666666666