## Objective:
Usually facilty names and types are written together in the 'facility name' column in a health facility table. For example, "Makala Health Centre". "Makala" is the facility name and "Health Centre" is the type of facility. We need to separate facility name and facility type and put them into two different columns. In order to do that we need to know which words refer to the type of a facility.

#Process
This is a python notebook that runs on google cloud. A python notebook consists of cells that include python scripts. Each cell can be run individually. In order to run a cell, first click/hover on the cell. After the click, you should see an arrow at the left side of the cell. Click on the arrow to run the cell. The cells need to be run top to bottom. 

This work consists of four steps. 

### Step 1: 
* Create a word frequency table:
A word frequency table will be generated after you run this notebook properly.
The word frequency table contains words and how many times they are repeated in a dataset.

### Step 2:
* Go through each word in the word frequency table:
Open the word frequency table and keep only the words that refer to a facility type such as **Centre, Clinic, Hospital, Health, HC, DISP, PHC ...**. Basically, you should delete all the words that refer to a facility name.

### Step 3:

* Repeat steps 1 and 2 for each health facility dataset you have explored in previous landscaping exercises 

### Step 4:

* Combine all the word frequency tables that you created for each health facility dataset into one excel file. Upload that file to your country's Health Facilities dataset folder in the google drive. 





# Part 1 : Import Modules


The cell below imports required modules. Just run the cell without modifying it. It does not create any output. It runs very fast. Run the cell by hovering over the cell and clicking the icon in the top left. When it is finished, a green checkmark will appear to the left of the cell.


In [None]:
import pandas as pd
import re
from google.colab import files
import io

# Part 2: Provide required inputs

The cell below requires some inputs from you. Please provide required inputs as explained below and then click the run icon on the top left of the cell:


*   **country name** : Name of country
*   **input_table**: Table name for which you want to create a frequency table. For our case it is one of the health facility datasets you've explored previously. Ideally you should use the HF dataset that you modified with standardized column names. The table file format should be csv or xlsx
*   **frequency_threshold**: Words that are repeated more than a certain amount of times will be kept in the output table (an output table is what is automatically generated by Python). For large health facility lists (more than 5k facilities) put the number 10, but for small lists (less than 5k facilities) put 5
* **field_name**: Column name in the input table (aka health facility dataset) for the column that has facility name. If using the HF dataset version that you modified with column name standardization, you would put "facility_name" here





In [None]:
country_name="DRC"
input_table="schools2.csv"
frequency_threshold=25
field_name="name"


# Part 3: Upload input table

The cell below allows you to choose the input table from your computer. After you run the cell below, you should see a **choose files** option. Click on the **choose files** option and then navigate your computer to find the file you want to read as **input_table**--in other words, find the health facility dataset. It may take some time for Python to read a table based on file size and speed of your internet connection.
 


In [None]:
uploaded=files.upload()

Saving schools2.csv to schools2.csv


#Part 4: Create word frequency table

Run the cell below. No need for any modification. After processing is done, you should see a csv file downloaded to your computer. The file is going to be saved in your **downloads** directory. The output table will have the same name as the health facility dataset with word_frequency. For example if the input table name is DRC_health_facility, the output table name is going to be DRC_health_facility_word_frequency.  Go through each word in the output table (column B) and keep only the rows with words that refer to a type of a health facility. For example, delete a row with the word 'Makala,' and keep any rows with words such as **Clinic, Hospital, Center, Dispensary, HS, PHC etc ...**. This table is going be used to separate types from facility names. 

In [None]:
def preclean(df, input_variable, output_variable, remove_accent=False):
    '''
    Basic cleaning and standardization of a column
    :param df: dataframe
    :param input_variable: a column name to be cleaned/standardized
    :param output_variable: cleaned/standardized column name
    :param remove_accent:remove french characters
    :return: dataframe
    '''
    ##=============================================================================#
    df[output_variable] = df[input_variable] + " "
    # replace NAs with empty string ''
    df[output_variable] = df[output_variable].fillna('')
    # remove accent marks
    if remove_accent:
        df[output_variable] = [unidecode.unidecode(n) for n in df[output_variable]]
    df[output_variable] = df[output_variable] \
        .str.replace(" III | Iii  | iii ", " 3 ") \
        .str.replace(" II | Ii  | ii ", " 2 ") \
        .str.replace(" I | i ", " 1 ") \
        .str.replace(" IV | Iv | iv ", " 4 ") \
        .str.replace('&', 'and')
    df[output_variable] = df[output_variable].apply(lambda x: " ".join(re.split('(\d+)', x)))
    df[output_variable] = df[output_variable].map(lambda x: re.sub(r'[^a-zA-Z0-9]', ' ', x)).str.strip().str.replace("  "," ")
    df[output_variable] = df[output_variable].str.title(). \
        str.replace(r'\s+', ' ').str.strip()
    # replace NAs in output_variable with empty string ''
    df[output_variable] = df[output_variable].fillna(' ')


def get_word_frequency(df,input_var,frequency_threshold=5 ):
    '''
    :param df: input df
    :param input_var: name of the field to create word frequancy file
    :param file_format: format of the input data
    :param admin_name: name of the admin. It can be a country name, district name or province name
    :param output_path: location to save word frequancy table
    :param frequency_threshold: frequency of a word to include into frequency table
    :return: a csv file
    '''

    # preclean input_var
    preclean(df, input_var, "clean_name")
    # create frequency df
    word_count = df["clean_name"].str.split(expand=True).stack().value_counts()
    word_count_df = pd.DataFrame(word_count).reset_index().rename({0: "frequency", "index": "words"}, axis=1)
    word_count_df = word_count_df[word_count_df["frequency"] >= frequency_threshold]
    df.drop(["clean_name"], axis=1, inplace=True)
    return word_count_df

if input_table.endswith(".csv") :  
  df=pd.read_csv(io.BytesIO(uploaded[input_table])) 
elif input_table.endswith(".xlsx") :  
  df=pd.read_excel(io.BytesIO(uploaded[input_table])) 

df_out=get_word_frequency(df,field_name,frequency_threshold=frequency_threshold )
df_out["country"]=country_name
output_table_name=input_table.split(".")[0]+"_word_frequency.csv"
df_out.to_csv(output_table_name)
files.download(output_table_name)



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Part 5: Repeat for all datasets and combine into one file

Repeat all steps for each health facility dataset you have explored in previous landscaping exercises 

Combine all the word frequency tables that you created for each health facility dataset into one excel file. Upload that file to your country's Dictionaries folder in the google drive. 