# 3. Text data Cleaning
-------------------
Group 3 , September 28, 2022
1. Gezhi Cheng, 
2. Haowei Lee, 
3. Ziyi Liu, 
4.VS Chaitanya Madduri

> <i>Description: In this notebook the text data will be preprocessed</i>


<div class="alert alert-block alert-info">
    <b>FYI:</b> # Please run this notebook in the colab . 
</div> 

### Pre requisites: 
1. And add the shortcut of the drive link :https://drive.google.com/drive/folders/1X4UdGsQiHVWSr63FRiz8rwOuWW5Ua8uI?usp=sharing to your personal drive.


- As the we used colab computation engine and also the files are very large in size. We used our personal google drive folders to save the large data files.


Files:
Selected_10k_v2.csv - to store the 10-k files of the companies selected

### Output files:

Files:
df_final.csv - To store the stemmed results and word frequencies. 



## 1. Import Required Packages 

In [None]:
# Connecting to the google drive
from google.colab import drive
drive.mount('/content/drive')
from IPython.display import clear_output

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import nltk
import re 

from nltk.corpus import stopwords                         # Removing all the stopwords
from nltk.stem.porter import PorterStemmer                # Reducing words to base form

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## 2.Loading the files to the notebook

### 2.1 Load the dataframe with 10-k text

In [None]:
DIR_PATH = "/content/drive/MyDrive/SPM_files/"

# sample data
df = pd.read_csv(DIR_PATH + "Selected_10k_v2.csv")


In [None]:
df.head()

Unnamed: 0,Company_Key,Text_data,Quarter_details,Year
0,1069533,b'<Header>\r\n<FileStats>\r\n <FileName>201...,QTR3,2013
1,1069533,b'<Header>\r\n<FileStats>\r\n <FileName>201...,QTR1,2014
2,1069533,b'<Header>\r\n<FileStats>\r\n <FileName>201...,QTR2,2014
3,1069533,b'<Header>\r\n<FileStats>\r\n <FileName>201...,QTR3,2014
4,1069533,b'<Header>\r\n<FileStats>\r\n <FileName>201...,QTR1,2015


### 2.2 Extracting the 10-k fillings 

Some times the dataframe contains a mix of 10-K an 10-Q files and we are making sure to process only 10-K.

In [None]:
df['filing_type'] = np.where(df['Text_data'].str.contains("10-K_edgar_data"), "10-K", "10-Q")

In [None]:
df['filing_type'].value_counts()

10-Q    3592
10-K    1184
Name: filing_type, dtype: int64

In [None]:
df = df[df['filing_type'] == "10-K"]

In [None]:
# Reseting the index
df.reset_index(drop=True, inplace=True)

## 3 Stemming the Text data

stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

In [None]:
# Printing the first text record
df['Text_data'][0]

'b"<Header>\\r\\n<FileStats>\\r\\n    <FileName>20111121_10-K_edgar_data_1126956_0001126956-11-000074.txt</FileName>\\r\\n    <GrossFileSize>23606915</GrossFileSize>\\r\\n    <NetFileSize>321950</NetFileSize>\\r\\n    <NonText_DocumentType_Chars>5568598</NonText_DocumentType_Chars>\\r\\n    <HTML_Chars>8845803</HTML_Chars>\\r\\n    <XBRL_Chars>6064230</XBRL_Chars>\\r\\n    <XML_Chars>2002188</XML_Chars>\\r\\n    <N_Exhibits>11</N_Exhibits>\\r\\n</FileStats>\\r\\n<SEC-Header>\\r\\n0001126956-11-000074.hdr.sgml : 20111121\\r\\n<ACCEPTANCE-DATETIME>20111118175254\\r\\nACCESSION NUMBER:\\t\\t0001126956-11-000074\\r\\nCONFORMED SUBMISSION TYPE:\\t10-K\\r\\nPUBLIC DOCUMENT COUNT:\\t\\t13\\r\\nCONFORMED PERIOD OF REPORT:\\t20110930\\r\\nFILED AS OF DATE:\\t\\t20111121\\r\\nDATE AS OF CHANGE:\\t\\t20111118\\r\\n\\r\\nFILER:\\r\\n\\r\\n\\tCOMPANY DATA:\\t\\r\\n\\t\\tCOMPANY CONFORMED NAME:\\t\\t\\tLACLEDE GROUP INC\\r\\n\\t\\tCENTRAL INDEX KEY:\\t\\t\\t0001126956\\r\\n\\t\\tSTANDARD INDUSTRIAL 

In [None]:
df.head()

Unnamed: 0,Company_Key,Text_data,Quarter_details,Year,filing_type
0,1126956,"b""<Header>\r\n<FileStats>\r\n <FileName>201...",QTR4,2011,10-K
1,1126956,"b""<Header>\r\n<FileStats>\r\n <FileName>201...",QTR4,2011,10-K
2,1126956,b'<Header>\r\n<FileStats>\r\n <FileName>201...,QTR4,2012,10-K
3,1126956,b'<Header>\r\n<FileStats>\r\n <FileName>201...,QTR4,2012,10-K
4,1126956,b'<Header>\r\n<FileStats>\r\n <FileName>201...,QTR4,2013,10-K


In [None]:
# dividing the dataset into two chunks as for 300 records it takes around two 
#hours for processing 
# Incase we are doing a batchwise execution
# df2 = df[601:901].copy()
# df3 = df[901:].copy()

In [None]:

def data_clean(input_text):
    '''
    The function will do the following:
    1. Extract only the Aplhas numerical strings 
    2. Apply porter stemming which will convert a word to its root form . 
        for example : hidden will be converted to hide
    3. removes words less than 3 letters (making sure to avoid residual strings)
    Augments:
    input_text: the text to be stemmed
    
    return :
    test_process/input_text : cleaned text if there are any errors it will return the orginal text
    '''
    ps = PorterStemmer()
    try:
        test_process = re.sub('[^a-zA-Z0-9]', ' ', input_text)         # Removing special symbols like ... ! and keeping only text
        test_process = test_process.lower()                                     # Lower case
        test_process = test_process.split()                                     # string split into words
        test_process = [ps.stem(word) for word in test_process                  # reducing words to base form
              if (not word in set(stopwords.words('english')) ) and len(word)>3  ]
        test_process = " ".join(test_process)
        return test_process
    except:
        return input_text


### 3.1 Removing the header tags

In [None]:
df['Text_data'] = df['Text_data'].str.split("</Header>").str[1]

Note: As below step was time taking we have disturbuted the records and ran individually in our machines and joinined the results at the end.

In [None]:
 # Please note this particular step takes more than an hour long time to execute.
 # For 300 records it takes around 4h.  We have processes around 900 10-k fillings
df['Text_data_cleaned'] = df['Text_data'].apply(lambda x: data_clean(x) )

In [None]:
df.head()

Unnamed: 0,Company_Key,Text_data,Quarter_details,Year,filing_type
901,72741,\r\n\r\n 0000072741-17-000007.txt : 20170223\r...,QTR1,2017,10-K
902,72741,\r\n\r\n 0000072741-17-000007.txt : 20170223\r...,QTR1,2017,10-K
903,72741,\r\n\r\n 0000072741-17-000007.txt : 20170223\r...,QTR1,2017,10-K
904,72741,\r\n\r\n 0000072741-18-000028.txt : 20180226\r...,QTR1,2018,10-K
905,72741,\r\n\r\n 0000072741-18-000028.txt : 20180226\r...,QTR1,2018,10-K


In [None]:
# Saving the cleaned files to a temporary files.Namely 
# first_600.csv
# Selected_10k_cleaned_part2.csv
# Selected_10k_cleaned_part4.csv
# Selected_10k_cleaned_part3.csv


# Saving the cleaned files to a temporary file.
# df.drop(['Text_data'], axis=1).to_csv(DIR_PATH + "Selected_10k_cleaned_part3.csv")

In [None]:
df['Text_data_cleaned'][901]

'0000072741 000007 20170223 a201610kdocu ndocument nunit state secur exchang commiss washington 20549 form annual report pursuant section secur exchang 1934 nfor fiscal year end decemb 2016 ntransit report pursuant section secur exchang 1934 nfor transit period ncommiss file number registr state incorpor address telephon number employ identif 5324 eversourc energi massachusett voluntari associ cadwel drive springfield massachusett 01104 telephon 5000 2147929 00404 connecticut light power compani connecticut corpor selden street berlin connecticut 06037 1616 telephon 5000 0303850 02301 nstar electr compani massachusett corpor boylston street boston massachusett 02199 telephon 5000 1278810 6392 public servic compani hampshir hampshir corpor energi park north commerci street manchest hampshir 03101 1134 telephon 5000 0181050 7624 western massachusett electr compani massachusett corpor cadwel drive springfield massachusett 01104 telephon 5000 1961130 nsecur regist pursuant section registr 

## 4. Frequency Calculations

In [None]:
# incase if the execution has been stopped we will continue with the saved files
df = pd.read_csv(DIR_PATH + "Selected_10k_cleaned_part2.csv")

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,Company_Key,Quarter_details,Year,filing_type,Text_data_cleaned
0,601,1326160,QTR1,2019,10-K,0001326160 000057 20190228 20181231x10k form n...
1,602,1326160,QTR1,2020,10-K,0001326160 000034 20200220 20191231x10k form n...
2,603,1326160,QTR1,2020,10-K,0001326160 000034 20200220 20191231x10k form n...
3,604,1326160,QTR1,2020,10-K,0001326160 000034 20200220 20191231x10k form n...
4,605,1326160,QTR1,2020,10-K,0001326160 000034 20200220 20191231x10k form n...


In [None]:
# Key words for potential zombie compies (from case study)
operation_related = ["restructure", "reorganization", "reorganize", "reorganized", "reorganizing", "turnaround", "restructuring", "cost", "asset"]
credit_related = ["loan", "concession", "swap", "forgiveness", "moratorium", "bond", "covenant"]

In [None]:
def get_negative(data):
    '''
    The function calculates the frequency count of the negative words used on 
    the text 
    Augments:
    data: Stemmed text

    
    return :
    negative_percentage : negative words frequencey count
    '''

    DIR_PATH = "/content/drive/MyDrive/Strategy_Final/"
    negative_words = []
    with open(DIR_PATH + 'negative-words.txt', 'r') as f:
        for line in f:
            negative_words.append(line[:-1])
    # negative_words


    negative_count = 0

    words = data.split()

    for _ in words:
        if _ in negative_words:
            negative_count += 1

    negative_percentage = negative_count / len(words)

    return negative_percentage


def get_operation(data):
  '''
    The function calculates the frequency count of the operational
     words used on  the text 
    Augments:
    data: Stemmed text

    
    return :
    operation_count : operations words frequency count
    '''
    operation_count = 0

    words = data.split()

    for _ in words:
        if _ in operation_related:
            operation_count += 1
    
    return operation_count / len(words)

def get_credit(data):
  '''
    The function calculates the frequency count of the credit
     words used on  the text 
    Augments:
    data: Stemmed text

    
    return :
    operation_count : credit words frequency count
    '''
    credit_count = 0

    words = data.split()

    for _ in words:
        if _ in credit_related:
            credit_count += 1

    return credit_count / len(words)

In [None]:
df_final = df.copy()
#loading the saved files from the storage if the kernal breaks


In [None]:
df_final.head()

Unnamed: 0,Company_Key,Quarter_details,Year,filing_type,Text_data_cleaned,operation_percentage,negative_percentage,credit_percentage,compamy_name
0,1126956,QTR4,2011,10-K,0001126956 000074 20111121 lacledegroupform10 ...,0.015983,0.018151,0.001653,LACLEDE GROUP INC
1,1126956,QTR4,2011,10-K,0001126956 000075 20111121 lacledegasfor10 k20...,0.01998,0.017551,0.001041,LACLEDE GAS CO
2,1126956,QTR4,2012,10-K,0001126956 000080 20121119 lacledegroupform10 ...,0.007124,0.013886,0.009069,LACLEDE GROUP INC
3,1126956,QTR4,2012,10-K,0001126956 000081 20121119 lacledegasform10 k2...,0.009466,0.012643,0.01383,LACLEDE GAS CO
4,1126956,QTR4,2013,10-K,0001126956 000067 20131126 20130930x10k 2013 n...,0.010843,0.015798,0.008905,LACLEDE GROUP INC


### 4.1 Calculation of the frequency count

In [None]:
# Calculating operating percentage
df_final["operation_percentage"] = df_final["Text_data_cleaned"].apply(get_operation)


In [None]:
# Calculating negative words percentage
df_final["negative_percentage"] = df_final["Text_data_cleaned"].apply(get_negative)

In [None]:
# Calculating credit words percentage
df_final["credit_percentage"] = df_final["Text_data_cleaned"].apply(get_credit)

In [None]:
df_final.head(2)

Unnamed: 0,Company_Key,Quarter_details,Year,filing_type,Text_data_cleaned,operation_percentage,negative_percentage,credit_percentage
0,1126956,QTR4,2011,10-K,0001126956 000074 20111121 lacledegroupform10 ...,0.015983,0.018151,0.001653
1,1126956,QTR4,2011,10-K,0001126956 000075 20111121 lacledegasfor10 k20...,0.01998,0.017551,0.001041


### 4.2 Calculation of the frequency count

In [None]:
df_final.describe()

Unnamed: 0,Company_Key,Year,operation_percentage,negative_percentage,credit_percentage
count,1184.0,1184.0,1184.0,1184.0,1184.0
mean,726327.3,2016.78125,0.013722,0.017724,0.002527
std,542336.5,2.953907,0.003328,0.00381,0.002245
min,4904.0,2011.0,0.003845,0.00538,0.000184
25%,76063.0,2014.0,0.011366,0.015568,0.001363
50%,922224.0,2017.0,0.013663,0.017717,0.001856
75%,1109357.0,2019.0,0.015539,0.019801,0.002769
max,1733998.0,2021.0,0.025175,0.030543,0.019641


In [None]:
df_final.head()

Unnamed: 0,Company_Key,Quarter_details,Year,filing_type,Text_data_cleaned,operation_percentage,negative_percentage,credit_percentage,compamy_name
0,1126956,QTR4,2011,10-K,0001126956 000074 20111121 lacledegroupform10 ...,0.015983,0.018151,0.001653,LACLEDE GROUP INC
1,1126956,QTR4,2011,10-K,0001126956 000075 20111121 lacledegasfor10 k20...,0.01998,0.017551,0.001041,LACLEDE GAS CO
2,1126956,QTR4,2012,10-K,0001126956 000080 20121119 lacledegroupform10 ...,0.007124,0.013886,0.009069,LACLEDE GROUP INC
3,1126956,QTR4,2012,10-K,0001126956 000081 20121119 lacledegasform10 k2...,0.009466,0.012643,0.01383,LACLEDE GAS CO
4,1126956,QTR4,2013,10-K,0001126956 000067 20131126 20130930x10k 2013 n...,0.010843,0.015798,0.008905,LACLEDE GROUP INC


### 4.3 Filtering the records for company with most restructuring effects

Exploratory data analysis

In [None]:
df_final[(df_final['operation_percentage']>0.02) & (df_final['negative_percentage']>0.025) & (df_final['operation_percentage']>0.01) ]

Unnamed: 0,Company_Key,Quarter_details,Year,filing_type,Text_data_cleaned,operation_percentage,negative_percentage,credit_percentage
271,92521,QTR1,2014,10-K,0000092521 000003 20140224 sps1231201310 nsp 2...,0.020494,0.025209,0.001278


In [None]:
fg = df_final[(df_final['operation_percentage']>0.0155) & (df_final['negative_percentage']>0.019) & (df_final['operation_percentage']>0.002769) ]

In [None]:
fg = df_final[(df_final['operation_percentage']>0.0155) & (df_final['negative_percentage']>0.019) & (df_final['operation_percentage']>0.002769) ]

Unnamed: 0,Company_Key,Quarter_details,Year,filing_type,Text_data_cleaned,operation_percentage,negative_percentage,credit_percentage
7,1126956,QTR4,2015,10-K,0001126956 000077 20151124 lglgcagc 20150930x1...,0.015643,0.019801,0.002205
8,1126956,QTR4,2015,10-K,0001126956 000077 20151124 lglgcagc 20150930x1...,0.015643,0.019801,0.002205
9,1126956,QTR4,2015,10-K,0001126956 000077 20151124 lglgcagc 20150930x1...,0.015643,0.019801,0.002205
46,1060391,QTR1,2013,10-K,0001060391 000011 20130219 rsg2012x1231x10xk n...,0.016343,0.020710,0.001245
47,1060391,QTR1,2014,10-K,0001060391 000011 20140213 2013x1231x10xk nrsg...,0.017394,0.020522,0.001819
...,...,...,...,...,...,...,...,...
919,827052,QTR1,2017,10-K,0000827052 000033 20170221 sce201610k ndocumen...,0.018008,0.019756,0.001238
922,827052,QTR1,2019,10-K,0000827052 000034 20190228 sceq4201810k 2018 n...,0.017271,0.022162,0.000982
923,827052,QTR1,2019,10-K,0000827052 000034 20190228 sceq4201810k 2018 n...,0.017271,0.022162,0.000982
924,827052,QTR1,2020,10-K,0000827052 000026 20200227 sceq4201910k form 2...,0.017329,0.022413,0.001644


In [None]:
fg[(fg['Company_Key'] == 92521) & (fg['Year'] == 2019) ]['Text_data_cleaned'][922]

'0000827052 000034 20190228 sceq4201810k 2018 ndocument nunit state secur exchang commiss washington 20549 form mark annual report pursuant section secur exchang 1934 fiscal year end decemb 2018 transit report pursuant section secur exchang 1934 transit period commiss file number exact name registr specifi charter state jurisdict incorpor organ employ identif number 9936 edison intern california 4137452 2313 southern california edison compani california 1240335 edison intern southern california edison compani 2244 walnut grove avenu rosemead california 91770 address princip execut offic 2244 walnut grove avenu rosemead california 91770 address princip execut offic 2222 registr telephon number includ area code 1212 registr telephon number includ area code secur regist pursuant section titl class name exchang regist edison intern common stock valu nyse southern california edison compani cumul prefer stock nyse american seri seri seri seri nsecur regist pursuant section none indic check m

In [None]:
# exporting the file for future use
# DIR_PATH = "/content/drive/MyDrive/SPM_files/"
# df_final.to_csv(DIR_PATH + "final_df.csv", index=None)

## End of the Notebook
