# References



- [Fuzzy string matching in Python](https://marcobonzanini.com/2015/02/25/fuzzy-string-matching-in-python/)

- [how to do a vlookup in Python](https://michaeljsanders.com/2017/04/17/python-vlookup.html)
    - `.map()` with a dictionary
    - `.merge()` with a left join

In [1]:
import fuzzywuzzy
from fuzzywuzzy import process

In [2]:
import numpy as np
import pandas as pd
import os
pd.set_option('display.max_rows', 4)

# Introduction


This script is the second step of preparing dataset for monthly reporting: clean and fill in necessary information to the data.

In this step, I need to complete the following three tasks:

- New cases come from new clients or old clients. I need to identify the new clients and add them to my `clients.xlsx` file.

- In addition, our headquarter record names differently for the same client due to varied reasons, sometimes it's because of typo and abbreviations etc., sometimes it's because a client firm has different offices using different names. For our office's marketing report, I only need one sigular name for one client no matter how many offices they have.

- At the last step, I want to attach client nature information to the new cases, so that I can create pivot table in Excel easily with complete information on file. 

# Import Data

In [3]:
output_dir = os.path.dirname(os.getcwd()) # save output in folder one level up

In [4]:
# clients on file
clients = pd.read_excel(output_dir + '/' + 'clients.xlsx')
clients.head(1)

Unnamed: 0,委托人,委托人性质,委托人其他使用名
0,3COR,合作所,


In [5]:
# new cases
fn = 'newCases20220723211639.xlsx'
cases = pd.read_excel(fn)
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  75 non-null     int64  
 1   委托人         75 non-null     object 
 2   委托人国别       75 non-null     object 
 3   本所案号        75 non-null     object 
 4   案件类型        75 non-null     object 
 5   申请人         66 non-null     object 
 6   申请人国别       66 non-null     object 
 7   商标名称        73 non-null     object 
 8   类别          73 non-null     float64
 9   商标号         47 non-null     object 
 10  立案日         75 non-null     object 
dtypes: float64(1), int64(1), object(9)
memory usage: 6.6+ KB


# Fuzzy Search Client Names

### Test fuzzy match result for one case only

I use `fuzzywuzzy.process()` function to do fuzzy lookup by comparing the client name '委托人' in `cases` and `clients`. I want the matching result to be from `clients` so all new cases will use the same name for the same client.

If matching results are not found or with low score, it's highly probably these new cases are from new clients. I'll check these results in excel and add them to `clients` file after verification.

In [6]:
# client name in `cases` file
cases.iloc[0]['委托人']

'BERESKIN & PARR LLP'

In [7]:
# fuzzy matching result found in `clients` file
testResult = process.extractOne(cases.iloc[0]['委托人'], clients['委托人'])
testResult

('BERESKIN & PARR LLP', 100, 13)

The matching score is `100` if they match completely. The closer the match, the higher the score.

### Conduct fuzzy matchiing for all cases


In [8]:
# return matching result found in client name ('委托人') column if score above 90;
# else, return matching results found in other names for the client ('委托人其他使用名') 
bestMatches = cases['委托人']\
              .apply(lambda x: process.extractOne(x, clients['委托人'])\
                     if process.extractOne(x, clients['委托人'])[1] > 90 \
                     else process.extractOne(x, clients['委托人其他使用名']) )

In [9]:
scores = [x[1] for x in bestMatches]
results = [x[0] for x in bestMatches]
indices = [x[2] for x in bestMatches]

In [10]:
result_df = pd.DataFrame({'Original': cases['委托人'],
                   'Score': scores,
                   'Result': results, 
                  "ClientIndex": indices})

In [11]:
result_df

Unnamed: 0,Original,Score,Result,ClientIndex
0,BERESKIN & PARR LLP,100,BERESKIN & PARR LLP,13
1,BERESKIN & PARR LLP,100,BERESKIN & PARR LLP,13
...,...,...,...,...
73,"CHICO'S FAS, INC.",40,"PATTERSON THUENTE CHRISTENSEN PEDERSEN, P.A. (...",145
74,"CHICO'S FAS, INC.",40,"PATTERSON THUENTE CHRISTENSEN PEDERSEN, P.A. (...",145


### Check the results scored above 90

In [12]:
result_df.loc[result_df['Score']>90, ['Original', 'Score', 'Result']].sort_values('Score')

Unnamed: 0,Original,Score,Result
64,OMIELIFE INC.,96,"OMIELIFE, INC."
37,OMIELIFE INC.,96,"OMIELIFE, INC."
...,...,...,...
30,"YOUNIQUE, LLC",100,"YOUNIQUE, LLC"
69,MERCHANT & GOULD P.C.,100,MERCHANT & GOULD P.C.


In [14]:
# export result >= 95 to excel for review later
notLowerThan_95 = result_df[result_df.Score >= 95].drop_duplicates(subset=['Original', 'Result'])
notLowerThan_95.to_excel('notLowerThan_95.xlsx')

### Resolve the results scored lower than 95

In [15]:
# how many matches score under 95
len(result_df[result_df['Score']<95])

13

In [16]:
# export result < 95 to excel for review (distinct client names from new `cases` for faster check)
lowerThan_95 = result_df[result_df.Score < 95].drop_duplicates(subset=['Original'])
lowerThan_95.to_excel('lowerThan_95.xlsx')

##### Instructions

1. Check `lowerThan_95.xlsx` file, if these clients are really not on file yet, add them to `clients.xlsx`;
2. Rerun all the code above to import the updated `clients` file and do the matching again;
3. when no more matchinig results scored lower than 95, check `notLowerThan_95.xlsx` file, verify the results are correct;
4. After all of the above done, continue with following tasks to update client name ('委托人') and client nature ('委托人性质') columns in `cases`.

# Add 委托人性质 Column in Cases

In [18]:
# look up '委托人性质' from `client`
clientNature = result_df.merge(clients, left_on="ClientIndex", right_index = True)['委托人性质'].tolist()

# update '委托人性质' in `cases`
cases['委托人性质'] = clientNature
cases['委托人性质'].head()

0    合作所
1    合作所
    ... 
3    合作所
4    合作所
Name: 委托人性质, Length: 5, dtype: object

In [19]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  75 non-null     int64  
 1   委托人         75 non-null     object 
 2   委托人国别       75 non-null     object 
 3   本所案号        75 non-null     object 
 4   案件类型        75 non-null     object 
 5   申请人         66 non-null     object 
 6   申请人国别       66 non-null     object 
 7   商标名称        73 non-null     object 
 8   类别          73 non-null     float64
 9   商标号         47 non-null     object 
 10  立案日         75 non-null     object 
 11  委托人性质       75 non-null     object 
dtypes: float64(1), int64(1), object(10)
memory usage: 7.2+ KB


### Update 委托人 in cases

Look up '委托人' column in `client` by `ClientIndex` and fill in the information in `cases`. The reason why I don't update with `result_df['Result']` is because some matching results come from the column of other names used by the client ('委托人其他使用名'). I don't want to use those names so as to keep all client names in `cases` aligned.

In [23]:
(cases['委托人'] == result_df['Original']).sum()

75

In [24]:
cases['委托人'] = clients.iloc[result_df['ClientIndex']]['委托人'].tolist()

### Save updated case file to excel

In [28]:
# export the updated cases to final case file.
cases.to_excel(fn+'updated.xlsx', header = True, index=False, encoding='utf-8')

### Use `openpyxl` to append output rows to existing final case file

In [29]:
import openpyxl
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows

In [31]:
# load the existing final case workbook
wb = load_workbook(output_dir+'/'+'final.xlsx')

# select the active worksheet
ws = wb.active

# append output dataframe rows to the active worksheet data
for r in dataframe_to_rows(cases, index= False, header=False):
    ws.append(r)
    
# save the new workbook data to `final{timestamp}.xlsx` first
import datetime
from datetime import datetime
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
timestamp = ''.join(x for x in timestamp if x.isalnum())
wb.save(filename='final{}.xlsx'.format(timestamp.strip(' :-')))

### Final step
- Check the new `final{timestamp}.xlsx` file, if everything OK, change file name to `final.xlsx` to replace the exsiting one.

- Cut and paste case files from `data/` folder to `in report` folder