## Objective:
Creating a list of types of health facilities from facility names in a health facility dataset.

#Process


### Step 1: 
* Make a copy of this script and add the name of your country, e.g. type_dictionaries_MOZ. Then, create a facility type table: A facility type will be generated after you run this notebook properly.


### Step 2:
* Go through each facility type in the facility type table:
Check facility types in the generated facility table table, and make correcitons if facility types have any spelling issues. Some of the facility types are abbrivated. Write full name of abbrivated facility names.For example write **HC **as **Health Center** .The facility type table will also have an **"abbreviation"** column. Please fill the abbreviation column with the abbreviation of the facility type. For example, if the facility type is Health Center, the abbreviation should be HC.


### Step 3:

* Repeat steps 1 and 2 for each health facility dataset you have explored in previous landscaping exercises 

### Step 4:

* Combine all the facility type tables that you created for each health facility dataset into one excel file. Upload that file to your country's Dictionaries folder in the google drive. 





# Part 1 : Import Modules


The cell below imports required modules. Just run the cell without modifying it. It does not create any output. It runs very fast. Run the cell by hovering over the cell and clicking the icon in the top left. When it is finished, a green checkmark will appear to the left of the cell.


In [None]:
import pandas as pd
import re
from google.colab import files
import io


# Part 2: Provide required inputs

The cell below requires some inputs from you. Please provide required inputs as explained below and then click the run icon on the top left of the cell:


*   **country name** : Name of country
*   **health_facility_table**: Table name for which you want to create a facility type table. For our case it is one of the health facility datasets you've explored previously. Ideally you should use the HF dataset that you modified with standardized column names. The table file format should be csv or xlsx
*   **word_frequency_table**:word_frequency_table that you created from the previous python exercise

* **facility_name**: Column name in the input table (aka health facility dataset) for the column that has facility name. If using the HF dataset version that you modified with column name standardization, you would put "facility_name" here





In [None]:
country_name="UGA"
health_facility_table="schools2.csv"
word_frequency_table="schools2_word_frequency.csv"
facility_name="name"


# Part 3: Upload input table

The cell below allows you to choose the input table from your computer. After you run the cell below, you should see a **choose files** option. Click on the **choose files** option and then navigate your computer to find the files you want to read as **health_facility_table** (aka the health facility dataset) and **word_frequency_table** from the last exercise.You can choose both files at the same time. It may take some time for Python to read a table based on file size and speed of your internet connection.
 


In [None]:
uploaded=files.upload()

Saving schools2.csv to schools2.csv
Saving schools2_word_frequency.csv to schools2_word_frequency.csv


#Part 4: Create word frequency table

Run the cell below. No need for any modification. After processing is done, you should see a csv file downloaded to your computer. The file is going to be saved in your **downloads** directory. The output table will have the same name as the health facility dataset with type_dictionary at the end. For example, if the input table name is DRC_health_facility, the output table name is going to be DRC_health_facility_type_dictionary.

In [None]:
def preclean(df, input_variable, output_variable, remove_accent=False):
  '''
    Basic cleaning and standardization of a column
    :param df: dataframe
    :param input_variable: a column name to be cleaned/standardized
    :param output_variable: cleaned/standardized column name
    :param remove_accent:remove french characters
    :return: dataframe
  '''
    ##=============================================================================#
  df[output_variable] = df[input_variable] + " "
  # replace NAs with empty string ''
  df[output_variable] = df[output_variable].fillna('')
  # remove accent marks
  if remove_accent:
      df[output_variable] = [unidecode.unidecode(n) for n in df[output_variable]]
  df[output_variable] = df[output_variable] \
      .str.replace(" III | Iii  | iii ", " 3 ") \
      .str.replace(" II | Ii  | ii ", " 2 ") \
      .str.replace(" I | i ", " 1 ") \
      .str.replace(" IV | Iv | iv ", " 4 ") \
      .str.replace('&', 'and')
  df[output_variable] = df[output_variable].apply(lambda x: " ".join(re.split('(\d+)', x)))
  df[output_variable] = df[output_variable].map(lambda x: re.sub(r'[^a-zA-Z0-9]', ' ', x)).str.strip().str.replace("  "," ")
  df[output_variable] = df[output_variable].str.title(). \
      str.replace(r'\s+', ' ').str.strip()
  # replace NAs in output_variable with empty string ''
  df[output_variable] = df[output_variable].fillna(' ')




def create_type_dict(df,input_var,word_frequancy_table, admin_name):
  '''

  :param df:  input df
  :param input_var: Name of the field to extract type info
  :param file_format: format of the input data
  :param word_frequancy_table: Word frequnacy table with words that refer to type of facility
  :param admin_name: name of the admin. It can be a country name, district name or province name
  :param output_path: location to save word frequancy table
  :return: type table
  '''

  # rea word frequancy table

  key_words=word_frequancy_table["words"].str.strip().tolist()
  
  preclean(df, input_var, "clean_name")
  for index, row in df.iterrows():
      name = df.at[index, "clean_name"].split()
      name_kept=[i for i in name if i not in key_words]
      df.at[index, "only_name"] = " ".join(name_kept)


  for index, row in df.iterrows():
      only_name =df.at[index, "only_name"].split(" ")
      full_name = df.at[index, "clean_name"].split(" ")
      name_kept=[i for i in full_name if i not in only_name]
      df.at[index, "only_type"] = " ".join(name_kept)

  # Export list of facility types
  type_count = df["only_type"].value_counts()
  type_count_df = pd.DataFrame(type_count).reset_index().rename({"only_type": "count", "index": "type"}, axis=1)
  type_count_df["country"] = admin_name
  type_count_df["abbreviation"] = ""
  type_count_df[["country", "type", "abbreviation", "count"]]
      #to_csv(os.path.join(output_path, "{}_type_dict.csv".format(admin_name)), encoding='latin-1', index=False)
  #df.drop(["only_name","only_type","clean_name"], axis=1, inplace=True)
  return  type_count_df


# read health facility table
if health_facility_table.endswith(".csv"):  
  hf_df=pd.read_csv(io.BytesIO(uploaded[health_facility_table])) 
elif health_facility_table.endswith(".xlsx") :  
  hf_df=pd.read_excel(io.BytesIO(uploaded[health_facility_table])) 

# read word frequency table table
if word_frequency_table.endswith(".csv"):  
  wf_df=pd.read_csv(io.BytesIO(uploaded[word_frequency_table])) 
elif word_frequency_table.endswith(".xlsx") :  
  wf_df=pd.read_excel(io.BytesIO(uploaded[word_frequency_table])) 
tpye_df=create_type_dict(hf_df,facility_name,wf_df, country_name)
#export type_df
output_table_name=health_facility_table.split(".")[0]+"_type_dictionary.csv"
tpye_df.to_csv(output_table_name)
files.download(output_table_name)



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Part 5: Repeat for all datasets and combine into one file

Repeat all steps for each health facility dataset you have explored in previous landscaping exercises 

Combine all the facility type tables that you created for each health facility dataset into one excel file. Upload that file to your country's Dictionaries folder in the google drive. 