# Initial Step:
## Cleaning and Organizing Lexis Nexis Data
* Using the Lexis Nexis data I collected and complied in LN_data, this notebook focuses on cleaning the ~10,000 total individual articles to make the dataset easier to navigate and analyze in separate notebook(s).


---

### Setup
* Any additional modules:

In [1]:
from collections import Counter
import string
import os
import json
import random
import re
import glob

### Functions

* Adding the option of functions that may help with the cleaning and organizing process.

* A notebook created to store a variety of different functions to allow for quick access.

In [2]:
 %run functions.ipynb

## Data Cleaning

### 1. Load and read all of the data from 2017-2021
Goals: 
   * to return a list of dictionaries where each dictionary contains an individual article for each year
       * previously, every 100 or so articles were grouped into their own txt file since I had to manually download every article in batches of 100 from Lexis Nexis. 
       * ex: all 2017 articles will be added to a single list of dictionaries dedicated to 2017 Lexis Nexis data.
   * split each article on the 'Body' of the document and remove any non-articles
       * include information from the header as it may be useful during the analysis.

In [3]:
raw_text = open('../data/LN_data/raw/LN_2017/LNpilot_2017.txt').read()

In [4]:
years = [f for f in os.listdir("../data/LN_data/raw") if f.startswith ("LN")]

In [5]:
years

['LN_2017', 'LN_2018', 'LN_2019', 'LN_2020', 'LN_2021']

In [6]:
raw_texts = glob.glob("../data/LN_data/raw/LN_*/LN*.txt")

### additional functions!

In [7]:
def process_LN_year(fpath):
    '''process all the text files in a folder, e.g. LN_2017 and return a list of dictionaries
    
    args:
         fpath - file path to the folder containing the text files
    returns:
         a list of dictionaries, where each dictionary is a single article
    '''
    
    docs = []
    ln_files = glob.glob(f'{fpath}/LN*.txt')
    print('Processing', len(ln_files), 'files')
    
    for fname in ln_files:
        print('processing', fname)
    
        docs.extend(process_LN_file(fname))
        
    return docs

In [8]:
def process_LN_file(fpath):
 
    docs = []

    raw_text = open(fpath).read()
    remove = raw_text.split('End of Document')
    all_docs = remove[1:]
    
    rejected_list = []

    for doc in all_docs:
        try: 
            body_text_start = doc.index('Body')+4
            body_text_end = doc.find('Load-Date:')

            body_text = doc[body_text_start:body_text_end].strip()

            header = doc[:body_text_start].strip()
            hlines = header.split('\n')
            
            doc = {
                    'title': hlines[0],
                    'source': hlines[1],
                    'date': hlines[2],
                    'body': body_text
            }
            
            docs.append(doc)
        except:
            rejected_list.append(doc)
            
    return docs

#### 2017 articles

In [9]:
LN2017_docs = process_LN_year('../data/LN_data/raw/LN_2017')

Processing 12 files
processing ../data/LN_data/raw/LN_2017/LN47_2017_1.txt
processing ../data/LN_data/raw/LN_2017/LN47_2017_2.txt
processing ../data/LN_data/raw/LN_2017/LN47_2017_3.txt
processing ../data/LN_data/raw/LN_2017/LN47_2017_4.txt
processing ../data/LN_data/raw/LN_2017/LN812_2017_1.txt
processing ../data/LN_data/raw/LN_2017/LN812_2017_2.txt
processing ../data/LN_data/raw/LN_2017/LN812_2017_3.txt
processing ../data/LN_data/raw/LN_2017/LN812_2017_4.txt
processing ../data/LN_data/raw/LN_2017/LN812_2017_5.txt
processing ../data/LN_data/raw/LN_2017/LNpilot2_2017.txt
processing ../data/LN_data/raw/LN_2017/LNpilot3_2017.txt
processing ../data/LN_data/raw/LN_2017/LNpilot_2017.txt


In [10]:
len(LN2017_docs)

974

#### 2018 articles

In [11]:
LN2018_docs = process_LN_year('../data/LN_data/raw/LN_2018')

Processing 22 files
processing ../data/LN_data/raw/LN_2018/LN1012_2018_1.txt
processing ../data/LN_data/raw/LN_2018/LN1012_2018_2.txt
processing ../data/LN_data/raw/LN_2018/LN1012_2018_3.txt
processing ../data/LN_data/raw/LN_2018/LN1012_2018_4.txt
processing ../data/LN_data/raw/LN_2018/LN1012_2018_5.txt
processing ../data/LN_data/raw/LN_2018/LN1012_2018_6.txt
processing ../data/LN_data/raw/LN_2018/LN13_2018_1.txt
processing ../data/LN_data/raw/LN_2018/LN13_2018_2.txt
processing ../data/LN_data/raw/LN_2018/LN13_2018_3.txt
processing ../data/LN_data/raw/LN_2018/LN13_2018_4.txt
processing ../data/LN_data/raw/LN_2018/LN46_2018_1.txt
processing ../data/LN_data/raw/LN_2018/LN46_2018_2.txt
processing ../data/LN_data/raw/LN_2018/LN46_2018_3.txt
processing ../data/LN_data/raw/LN_2018/LN46_2018_4.txt
processing ../data/LN_data/raw/LN_2018/LN46_2018_5.txt
processing ../data/LN_data/raw/LN_2018/LN46_2018_6.txt
processing ../data/LN_data/raw/LN_2018/LN79_2018_1.txt
processing ../data/LN_data/raw/LN

In [12]:
len(LN2018_docs)

1915

#### 2019 articles

In [13]:
LN2019_docs = process_LN_year('../data/LN_data/raw/LN_2019')

Processing 34 files
processing ../data/LN_data/raw/LN_2019/LN1112_2019_1.txt
processing ../data/LN_data/raw/LN_2019/LN1112_2019_2.txt
processing ../data/LN_data/raw/LN_2019/LN1112_2019_3.txt
processing ../data/LN_data/raw/LN_2019/LN1112_2019_4.txt
processing ../data/LN_data/raw/LN_2019/LN1112_2019_5.txt
processing ../data/LN_data/raw/LN_2019/LN1112_2019_6.txt
processing ../data/LN_data/raw/LN_2019/LN13_2019_1.txt
processing ../data/LN_data/raw/LN_2019/LN13_2019_2.txt
processing ../data/LN_data/raw/LN_2019/LN13_2019_3.txt
processing ../data/LN_data/raw/LN_2019/LN13_2019_4.txt
processing ../data/LN_data/raw/LN_2019/LN13_2019_5.txt
processing ../data/LN_data/raw/LN_2019/LN13_2019_6.txt
processing ../data/LN_data/raw/LN_2019/LN46_2019_1.txt
processing ../data/LN_data/raw/LN_2019/LN46_2019_2.txt
processing ../data/LN_data/raw/LN_2019/LN46_2019_3.txt
processing ../data/LN_data/raw/LN_2019/LN46_2019_4.txt
processing ../data/LN_data/raw/LN_2019/LN46_2019_5.txt
processing ../data/LN_data/raw/LN

In [14]:
len(LN2019_docs)

3106

#### 2020 articles

In [15]:
LN2020_docs = process_LN_year('../data/LN_data/raw/LN_2020')

Processing 33 files
processing ../data/LN_data/raw/LN_2020/LN1112_2020_1.txt
processing ../data/LN_data/raw/LN_2020/LN1112_2020_2.txt
processing ../data/LN_data/raw/LN_2020/LN1112_2020_3.txt
processing ../data/LN_data/raw/LN_2020/LN12_2020_1.txt
processing ../data/LN_data/raw/LN_2020/LN12_2020_2.txt
processing ../data/LN_data/raw/LN_2020/LN12_2020_3.txt
processing ../data/LN_data/raw/LN_2020/LN12_2020_4.txt
processing ../data/LN_data/raw/LN_2020/LN12_2020_5.txt
processing ../data/LN_data/raw/LN_2020/LN12_2020_6.txt
processing ../data/LN_data/raw/LN_2020/LN35_2020_1.txt
processing ../data/LN_data/raw/LN_2020/LN35_2020_2.txt
processing ../data/LN_data/raw/LN_2020/LN35_2020_3.txt
processing ../data/LN_data/raw/LN_2020/LN35_2020_4.txt
processing ../data/LN_data/raw/LN_2020/LN35_2020_5.txt
processing ../data/LN_data/raw/LN_2020/LN35_2020_6.txt
processing ../data/LN_data/raw/LN_2020/LN35_2020_7.txt
processing ../data/LN_data/raw/LN_2020/LN67_2020_1.txt
processing ../data/LN_data/raw/LN_2020/

In [16]:
len(LN2020_docs)

3057

#### 2021 articles
* data collected from Jan 2021 - March 2021

In [17]:
LN2021_docs = process_LN_year('../data/LN_data/raw/LN_2021')

Processing 8 files
processing ../data/LN_data/raw/LN_2021/LN_2021_1.txt
processing ../data/LN_data/raw/LN_2021/LN_2021_2.txt
processing ../data/LN_data/raw/LN_2021/LN_2021_3.txt
processing ../data/LN_data/raw/LN_2021/LN_2021_4.txt
processing ../data/LN_data/raw/LN_2021/LN_2021_5.txt
processing ../data/LN_data/raw/LN_2021/LN_2021_6.txt
processing ../data/LN_data/raw/LN_2021/LN_2021_7.txt
processing ../data/LN_data/raw/LN_2021/LN_2021_8.txt


In [18]:
len(LN2021_docs)

771

### 2. Write JSON files for each year
* taking the cleaned data, write each list as a JSON file

In [19]:
def write_LN_JSON(year, data):
    with open(f'../data/LN_data/cleaned/LN_{year}.json', 'w') as out:
        out.write(json.dumps(data,indent=4))

In [20]:
write_LN_JSON(2017, LN2017_docs)

In [21]:
write_LN_JSON(2018, LN2018_docs)

In [22]:
write_LN_JSON(2019, LN2019_docs)
write_LN_JSON(2020, LN2020_docs)

In [23]:
write_LN_JSON(2021, LN2021_docs)