In this notebook, we will have an exercise of text mining: find report categories.
    
Assume the situation that you have a lot of folder and files (over 10000 files or more), 
only few of them are drilling/completion reports you need. 
And the reports may also be saved in different types as *.pdf, *.doc, *.xlsx

So have a clear workflow to find these files in a clearly organzied folder is very useful.

In this workflow, the judgement is only focus on file names.

** I use fuzzywuzzy library to calculale string distance.
Here is the link of how to install and use: https://www.geeksforgeeks.org/fuzzywuzzy-python-library/

In [1]:
import numpy as np
import pandas as pd 
import os
# import textdistance
from fuzzywuzzy import fuzz
import re 



### 1. Scan and get pathes of all files within the given root path

In [2]:
# function comes from: https://thispointer.com/python-how-to-get-list-of-files-in-directory-and-sub-directories/
def getListOfFiles(dirName):
    # create a list of file and sub directories 
    # names in the given directory 
    listOfFile = os.listdir(dirName)
    allFiles = list()
    # Iterate over all the entries
    for entry in listOfFile:
        # Create full path
        fullPath = os.path.join(dirName, entry)
        # If entry is a directory then get the list of files in this directory 
        if os.path.isdir(fullPath):
            allFiles = allFiles + getListOfFiles(fullPath)
        else:
            allFiles.append(fullPath)
                
    return allFiles  

In [3]:
root = r'C:\MXD\courses'
filePath = getListOfFiles(root)

In [4]:
len(filePath)

7828

In [5]:
filePath[1111]

"C:\\MXD\\courses\\PETE 603\\Wattenbarger exam\\ExamB 20'05.pdf"

In [6]:
# get all file names without path
rows = []
for file in filePath:
    re1 = [i.start() for i in re.finditer(r'[\\]', file)]
    
    FileFullName = file[re1[-1]+1:]
    re2 = [i.start() for i in re.finditer(r'[.]', FileFullName)]
    if re2:
        fileName = FileFullName[:re2[-1]]
        Extension = FileFullName[re2[-1]+1:]
    else:
        fileName = FileFullName 
        Extension = []
    
    dictfile = {'Path': file[:re1[-1]], 'FileFullName': file[re1[-1]+1:], 'FileName': fileName, 'Extension': Extension}
    rows.append(dictfile)
    
df = pd.DataFrame.from_dict(rows, orient='columns')
df.head()

Unnamed: 0,Path,FileFullName,FileName,Extension
0,C:\MXD\courses,13IPTC_Tian - AYERS COMMENTS - 27 FEB.ppt,13IPTC_Tian - AYERS COMMENTS - 27 FEB,ppt
1,C:\MXD\courses,1_pvt (1).pdf,1_pvt (1),pdf
2,C:\MXD\courses,2011_05_spee_houston_palke_simulation.pdf,2011_05_spee_houston_palke_simulation,pdf
3,C:\MXD\courses,648 project-team 4.pdf,648 project-team 4,pdf
4,C:\MXD\courses,Anadarko_Lunch_Learn_3_22_2006_v2.pdf,Anadarko_Lunch_Learn_3_22_2006_v2,pdf


### 2. find target files by calculate string distance
since we are looking for some specific files, such as reservoir simulation report, drilling report, ect

we can pre-define a library of each type which contains most frequence words that people may use based on naming habits

** here I choose to search reservoir files as example

In [13]:
# reservoir files, could be expended based on naming habits
resKey = {'reserv', 'reservoir', 'reservoirs'}

In [15]:
def getScore(stra, resKey):
    return max([fuzz.ratio(b.lower(), a) for a in resKey for b in stra.split()])

In [9]:
def cleanName(stra):
    regex = re.compile('[^a-zA-Z]')
    stra = regex.sub(' ', stra.strip())
    return stra.strip().lower()

In [10]:
# clean file name, only keep alphabeta
df['CleanName'] = df['FileName'].apply(lambda x: cleanName(x))
df.head()

Unnamed: 0,Path,FileFullName,FileName,Extension,CleanName
0,C:\MXD\courses,13IPTC_Tian - AYERS COMMENTS - 27 FEB.ppt,13IPTC_Tian - AYERS COMMENTS - 27 FEB,ppt,iptc tian ayers comments feb
1,C:\MXD\courses,1_pvt (1).pdf,1_pvt (1),pdf,pvt
2,C:\MXD\courses,2011_05_spee_houston_palke_simulation.pdf,2011_05_spee_houston_palke_simulation,pdf,spee houston palke simulation
3,C:\MXD\courses,648 project-team 4.pdf,648 project-team 4,pdf,project team
4,C:\MXD\courses,Anadarko_Lunch_Learn_3_22_2006_v2.pdf,Anadarko_Lunch_Learn_3_22_2006_v2,pdf,anadarko lunch learn v


In [11]:
# first screen, only keep the pdf files which contains res
mask = (df['CleanName'].str.contains('res'))&(df['Extension']=='pdf')
df = df[mask]
df.head()

Unnamed: 0,Path,FileFullName,FileName,Extension,CleanName
208,C:\MXD\courses\Coursera\coursera-Applied-Data-...,2.2_Distance-Measures.pdf,2.2_Distance-Measures,pdf,distance measures
228,C:\MXD\courses\Coursera\coursera-Applied-Data-...,1.3_Regular-Expressions.pdf,1.3_Regular-Expressions,pdf,regular expressions
239,C:\MXD\courses\Coursera\coursera-Applied-Data-...,3.2_Identifying-Features-from-Text.pdf,3.2_Identifying-Features-from-Text,pdf,identifying features from text
334,C:\MXD\courses\Coursera\GIS,Resources-and-Help_2016_10_09.pdf,Resources-and-Help_2016_10_09,pdf,resources and help
340,C:\MXD\courses\GEOL 619,06_07_method_exploration_reservoir_geophysics.pdf,06_07_method_exploration_reservoir_geophysics,pdf,method exploration reservoir geophysics


In [16]:
# calculate text distance of filename with keyword dictionary
df['resScore'] = df['CleanName'].apply(lambda x: getScore(x, resKey))
df.head()

Unnamed: 0,Path,FileFullName,FileName,Extension,CleanName,resScore
208,C:\MXD\courses\Coursera\coursera-Applied-Data-...,2.2_Distance-Measures.pdf,2.2_Distance-Measures,pdf,distance measures,43
228,C:\MXD\courses\Coursera\coursera-Applied-Data-...,1.3_Regular-Expressions.pdf,1.3_Regular-Expressions,pdf,regular expressions,46
239,C:\MXD\courses\Coursera\coursera-Applied-Data-...,3.2_Identifying-Features-from-Text.pdf,3.2_Identifying-Features-from-Text,pdf,identifying features from text,43
334,C:\MXD\courses\Coursera\GIS,Resources-and-Help_2016_10_09.pdf,Resources-and-Help_2016_10_09,pdf,resources and help,63
340,C:\MXD\courses\GEOL 619,06_07_method_exploration_reservoir_geophysics.pdf,06_07_method_exploration_reservoir_geophysics,pdf,method exploration reservoir geophysics,100


In [19]:
# filted out low score files
df[df['resScore']>80]

Unnamed: 0,Path,FileFullName,FileName,Extension,CleanName,resScore
340,C:\MXD\courses\GEOL 619,06_07_method_exploration_reservoir_geophysics.pdf,06_07_method_exploration_reservoir_geophysics,pdf,method exploration reservoir geophysics,100
352,C:\MXD\courses\GEOL 619,15_sandstone_reservoir.pdf,15_sandstone_reservoir,pdf,sandstone reservoir,100
353,C:\MXD\courses\GEOL 619,16_carbonate_reservoir.pdf,16_carbonate_reservoir,pdf,carbonate reservoir,100
354,C:\MXD\courses\GEOL 619,17_reservoir_frontier_research.pdf,17_reservoir_frontier_research,pdf,reservoir frontier research,100
370,C:\MXD\courses\GEOL 619\New Folder,Depositional system and reservoir potential of...,Depositional system and reservoir potential of...,pdf,depositional system and reservoir potential of...,100
2552,C:\MXD\courses\PETE 637\NOTES,L09B_Fractured Reservoir Simulation.pdf,L09B_Fractured Reservoir Simulation,pdf,l b fractured reservoir simulation,100
2577,C:\MXD\courses\PETE 640,1555630898_Basic_Applied_Reservoir_Simulation.pdf,1555630898_Basic_Applied_Reservoir_Simulation,pdf,basic applied reservoir simulation,100
2687,C:\MXD\courses\PETE 640\PetE_640_Handouts,SOMERTON--High_Temperature_Behavior_of_Rocks_A...,SOMERTON--High_Temperature_Behavior_of_Rocks_A...,pdf,somerton high temperature behavior of rocks a...,100
2708,C:\MXD\courses\PETE 640,"SUN,MOHANTY--Simulation_of_Methane_Hydrate_Res...","SUN,MOHANTY--Simulation_of_Methane_Hydrate_Res...",pdf,sun mohanty simulation of methane hydrate res...,100
2727,C:\MXD\courses\PETE 641\641_Material,KOCABAS--Thermal_Transiens_During_Nonisotherma...,KOCABAS--Thermal_Transiens_During_Nonisotherma...,pdf,kocabas thermal transiens during nonisotherma...,100


** based on file pathes you can copy them into other target folder

This is very simple example of texting mining.

And there are always lots of ways to do it, like using regular expression, string distance or more.

Regular expression is very useful for the file with well API or some rules on well names.