## Objective:
Creating a list of misspellings in health facility types as extracted from from facility names in a health facility dataset.

#Process


### Step 1: 
* Make a copy of this script and add the name of your country, e.g. spelling_dictionaries_MOZ. Then, create a table with misspelled health facility type: This table will be generated after you run this notebook properly.

### Step 2:
* Go through each misspelled name in the misspelled type table:
We will use a misspelled type table in the next step to fix misspeling in some words. For example, we will convert all "hooital,hodpital,hopitap,hodpital" to "hospital". However, in some cases, some words that we find misspelled may not be wrong. For example, assume that 
this notebook finds that "sainte" should be spelled as "sante" (French for 'health') because they are very similar, but this assumption is not accurate. 'Sainte' could refer to 'Saint' (e.g. Sainte Marie) which is part of the facility name, not type. We should not convert 'sainte' to 'sante'. You should find these cases and delete them from the misspelled type table. 


### Step 3:

* Repeat steps 1 and 2 for each health facility dataset you have explored in previous landscaping exercises 

### Step 4:

* Combine all the misspelled type table that you created for each health facility dataset into one excel file. Upload that file to your country's Dictionaries folder in the google drive. 





# Part 1 : Import Modules


The cell below imports some modules that are not existant in python's default enviroment. You need to run the cell below to import external modules into the notebook environment. Just run the cell without modifying it.

In [None]:
!pip install symspellpy
!pip install thefuzz

Collecting symspellpy
  Downloading symspellpy-6.7.6-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 8.7 MB/s 
[?25hCollecting editdistpy>=0.1.3
  Downloading editdistpy-0.1.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (125 kB)
[K     |████████████████████████████████| 125 kB 57.3 MB/s 
[?25hInstalling collected packages: editdistpy, symspellpy
Successfully installed editdistpy-0.1.3 symspellpy-6.7.6
Collecting thefuzz
  Downloading thefuzz-0.19.0-py2.py3-none-any.whl (17 kB)
Installing collected packages: thefuzz
Successfully installed thefuzz-0.19.0


The cell below imports required modules. Just run the cell without modifying it. It does not create any output. It runs very fast. Run the cell by hovering over the cell and clicking the icon in the top left. When it is finished, a green checkmark will appear to the left of the cell.


In [None]:
import pandas as pd
import re,os
from google.colab import files
import io
from thefuzz import process, fuzz
from symspellpy import SymSpell

# Part 2: Provide required inputs

The cell below requires some inputs from you. Please provide required inputs as explained below and then click the run icon on the top left of the cell:


*   **country name** : ISO code of country (eg. MWI, KEN, ZMB etc)
*   **health_facility_table**: Name of the table for which you want to create a misspelled type table. For our case it is one of the health facility datasets you've explored previously. Ideally you should use the HF dataset that you modified with standardized column names. The table file format should be csv or xlsx
*   **type_dictionary_table**: type_dictionary_table that you created from the previous python exercise

* **facility_name**: Column name in the input table (aka health facility dataset) for the column that has facility name. If using the HF dataset version that you modified with column name standardization, you would put "facility_name" here





In [None]:
country_name="UGA"
health_facility_table="Uganda_MFL_modified.xlsx"
type_dictionary_table="Uganda_Combined_Health Facility_word_frequency.xlsx"
facility_name="facility_name2"


# Part 3: Upload input table

The cell below allows you to choose the input table from your computer. After you run the cell below, you should see a **choose files** option. Click on the **choose files** option and then navigate your computer to find the files you want to read as **health_facility_table** (aka the health facility dataset) and **type_dictionary_table** from the last exercise. You should **choose both files at the same time**. It may take some time for Python to read a table based on file size and speed of your internet connection.
 


In [None]:
uploaded=files.upload()

Saving test_health_facility.csv to test_health_facility (1).csv
Saving test_type_dict.csv to test_type_dict (1).csv


#Part 4: Create word frequency table

Run the cell below. No need for any modification. After processing is done, you should see a csv file downloaded to your computer. The file is going to be saved in your **downloads** directory. The output table will have the same name as the health facility dataset with spelling_dictionary at the end. For example, if the input table name is DRC_health_facility, the output table name is going to be DRC_health_facility_spelling_dictionary.

In [None]:
def preclean(df, input_variable, output_variable, remove_accent=False):
  '''
    Basic cleaning and standardization of a column
    :param df: dataframe
    :param input_variable: a column name to be cleaned/standardized
    :param output_variable: cleaned/standardized column name
    :param remove_accent:remove french characters
    :return: dataframe
  '''
    ##=============================================================================#
  df[output_variable] = df[input_variable] + " "
  # replace NAs with empty string ''
  df[output_variable] = df[output_variable].fillna('')
  # remove accent marks
  if remove_accent:
      df[output_variable] = [unidecode.unidecode(n) for n in df[output_variable]]
  df[output_variable] = df[output_variable] \
      .str.replace(" III | Iii  | iii ", " 3 ") \
      .str.replace(" II | Ii  | ii ", " 2 ") \
      .str.replace(" I | i ", " 1 ") \
      .str.replace(" IV | Iv | iv ", " 4 ") \
      .str.replace('&', 'and')
  df[output_variable] = df[output_variable].apply(lambda x: " ".join(re.split('(\d+)', x)))
  df[output_variable] = df[output_variable].map(lambda x: re.sub(r'[^a-zA-Z0-9]', ' ', x)).str.strip().str.replace("  "," ")
  df[output_variable] = df[output_variable].str.title(). \
      str.replace(r'\s+', ' ').str.strip()
  # replace NAs in output_variable with empty string ''
  df[output_variable] = df[output_variable].fillna(' ')





def generate_misspellings(df,input_var,type_dict, admin_name,  skip_spellings=[], min_length=5):
    '''

    :param df:  input df
    :param input_var: Name of the field to extract type info
    :param type_dict: type dictionary
    :param admin_name: name of the admin. It can be a country name, district name or province name
    :param output_path: location to save word frequancy table
    :return: type table
    '''



    type_dict=type_dict[type_dict.type.notnull()]
    type_keywords = ' '.join(list(type_dict['type'].str.lower())).split()

    # convert from list to set to remove repeating words, then convert to list again
    type_keywords_all = list(set(type_keywords))
    # keep only keywords with the minimum length
    type_keywords_to_check = [word for word in type_keywords_all if len(word) >= min_length]

    preclean(df, input_var, "clean_name")

    names = ' '.join(list(df[~pd.isna(df["clean_name"])]["clean_name"].str.lower())).split()
    columns = ['name', 'word', 'misspelling', 'frequency', 'score']
    results = pd.DataFrame()
    for word in type_keywords_to_check:
        # keep just words that start with the same letter as the type keyword
        # and have length at least half of the length of the type keyword
        # also remove the words that already appear in type keywords
        start_char = word[0]  # first letter
        min_len = len(word) // 2  # minimum length requirement
        names_word = [name for name in names if name.startswith(start_char)
                      and len(name) > min_len and name not in type_keywords_all]

        # write the relevant words to a text file
        filename = word + ".txt"
        file1 = open(filename, "w")
        file1.write(' '.join(names_word))
        file1.close()

        # generate word frequency dictionary
        sym_spell = SymSpell()
        sym_spell.create_dictionary(filename)
        freq_dict = sym_spell.words
        # remove the text file
        os.remove(filename)

        # compute similarity score with respect to the original word
        threshold = (len(word) - 1) / len(word)  # score threshold
        for spelling, frequency in freq_dict.items():
            if spelling in skip_spellings:
                continue
            ratio = fuzz.ratio(spelling, word)
            if ratio / 100 >= threshold:
                new_row = pd.DataFrame([[admin_name, word, spelling, frequency, ratio]], columns=columns)
                results = pd.concat([results, new_row])

    if results.shape[0] > 0:
        results['name'] = results['name'].str.upper()
    # reset and drop index
    results.reset_index(inplace=True, drop=True)
    return results

# read health facility table
if health_facility_table.endswith(".csv"):  
  hf_df=pd.read_csv(io.BytesIO(uploaded[health_facility_table])) 
elif health_facility_table.endswith(".xlsx") :  
  hf_df=pd.read_excel(io.BytesIO(uploaded[health_facility_table])) 

# read word frequency table table
if type_dictionary_table.endswith(".csv"):  
  sp_df=pd.read_csv(io.BytesIO(uploaded[type_dictionary_table])) 
elif type_dictionary_table.endswith(".xlsx") :  
  sp_df=pd.read_excel(io.BytesIO(uploaded[type_dictionary_table])) 

spelling_df=generate_misspellings(hf_df,facility_name,sp_df, country_name)
#export type_df
output_table_name=health_facility_table.split(".")[0]+"_spelling_dictionary.csv"
spelling_df.to_csv(output_table_name)
files.download(output_table_name)



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Part 5: Repeat for all datasets and combine into one file

Repeat all steps for each health facility dataset you have explored in previous landscaping exercises 

Combine all the misspelled type tables that you created for each health facility dataset into one excel file. Upload that file to your country's Dictionaries folder in the google drive. 