# Data Upload
The purpose of this notebook is to use the python library edgar to request 10-Qs from the SEC.gov website on registered companies. I want to focus on the quaterly reports of companies listed as NYSE stocks; there are 2,800 companies in that list. Once I have the files I will then clean the documents up the prepare them for the modeling phase. The plan is to use 18 companies.

**About the data:** The data was obtained from a list of companies that were registered with the SEC. In a scrap notebook I created a dataframe out of that list and cleaned it up. I dropped three columns which lead to the loss of 30,000 companies. This which was fine since the list contained 759,377 companies. 

Once I had the cleaned dataframe, I then searched for NYSE Companies within the data and created a new dataset. That smaller dataset is what you see being uploaded to this notebook. A link for the original list of SEC Registered companies will be provided in the resource section below. 

## Objectives:

- Download needed documentation from SEC using edgar 
- Filter documentation to only contain NYSE company files 
- Turn docments into data frames for easier cleanign process

# Bringing In the Data With Small Adjustments
In the section you will see the data frame created for 18 NYSE Companies of my choice, with their corresponding CIK number. You will then see the creation of an additional column that will have the companies corresponding ticker symbol. 

In [2]:
# Data Importing and Manipulation
import numpy as np
import pandas as pd

# Web Scrapping
from sec_edgar_downloader import Downloader
from bs4 import BeautifulSoup

# String Manipulation
import re
import unicodedata
from inscriptis import get_text
from cleantext import clean

# Cleaning up memory on computer after running code
import gc; gc.enable()

# Interact with the file systems
import os

In [3]:
# Importing the data
df = pd.read_csv('/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/CSV_Files/SEC Registered NYSE US Companies Down Sampled', index_col='Name')

In [4]:
df.columns # Checking the columns

Index(['Unnamed: 0', 'CIK_Num'], dtype='object')

In [5]:
df.drop(columns='Unnamed: 0', inplace=True) # Getting rid of extra column

In [6]:
# Creating a new with the companies ticker symbol
df['Ticker'] = ['MMM', 'ABT', 'ACN', 'ALGT',
                'BKR', 'BAX', 'BA', 'DVN',
                'GS', 'GS', 'JNJ', 'MA',
                'MS', 'PM', 'PRU', 'RTX',
                'SPG', 'SKYW', 'SO']

In [7]:
# Checking work
print('Column Names:')
print(df.columns)
print('\n')
print('Data Quick View:')
df.head()

Column Names:
Index(['CIK_Num', 'Ticker'], dtype='object')


Data Quick View:


Unnamed: 0_level_0,CIK_Num,Ticker
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
3M Co,66740,MMM
Abbott Laboratories,1800,ABT
Accenture Plc,1467373,ACN
Allegiant Travel Co,1362468,ALGT
Baker Hughes Co,1701605,BKR


# Company Document Downloads
This section will contain the downloading of the documents that will be used for sentiment analysis. The document of focus is the 10-Q or quaterly reports of each company in the dataset between the years of 2016 - 2020. This section of code was obtained from Bryan Arnold. The documentation for the code used will be located in the resource section of the notebook.

In [6]:
# Initialize a downloader instance.
# If no argument is passed to the constructor, the package
# will attempt to locate the user's downloads folder.
# I gave it the absolute path to my project folder

dl = Downloader("/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone")

**3M Co: MMM**

This is the first attempt at the obtaining the data needed from the downloaded documents. Below you will see this conducted on one of the 10-Qs for 3M Co that was downloaded. If this proves to be sucuessful, then an automated way will be the next step to obtain the remaning text from each downloaded document.  

In [8]:
 # Get all 10-Q filings (ticker: MMM )
# dl.get("10-Q", "MMM", after_date="20160101")

In [70]:
# Reading in the document
PATH = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/MMM/10-Q/0001558370-16-005213.txt"
file = open(PATH, "r", -1 , "utf-8")
text = file.read()
file.close()

In [71]:
soup = BeautifulSoup(text, 'lxml') # Parsing document

In [1]:
# Viewing header of document 
# sec_header_tag = soup.find('sec-header')
# display(sec_header_tag)

### CLEANING OF THE MMM IMPORTED DOCUMENT

In [73]:
# Remove all script and style elements
for script in soup(["script", "style"]):
    script.extract()

In [74]:
# Assign what's left to a string
pageText = soup.body.get_text()

In [75]:
pageText = unicodedata.normalize("NFKD", pageText) # Normalizing text format

In [76]:
# Getting rid of characters in document
pageText = "".join(c for c in pageText if c not in '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~') 

### RISK FACTOR SECTION OF DOCUMENT

In [77]:
# Trying to print the text of the Risk Factor section
start = 'ITEM 1A'
end = 'ITEM 2'
result = re.search('%s(.*)%s' % (start, end), pageText)

print(result)

None


# Quick Observation 
As you can see from the cell above, there are non sections in the text labeled "Item 1A" or "Item 2". My assumption is that each company files their documents in a different style or order. So trying to pull out a certain section will be a long winded task; something I will have to try when there isn't a deadline to meet. 

## Next Steps
Below you will see the code for downloading each companies 10-Qs between 2016-2020. 

# Natural Languge Processing Steps
Since we are working with multiple documents for each company, we will be using TF-IDF. The first thing we will do is create a document paths for each company. The next thing we will need to do turn each company text files into a dataframe; the df will make it easy to clean up the documents; removing stops words & punctuations, turning all uppercase letters to lower case.   

# 3M Co: MMM

In [10]:
# Creating the path to list of 10-Qs
MMM_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/MMM/10-Q/"

# Abbott Laboratories: ABT

In [17]:
# Get all 10-Q filings (ticker: ABT )
# dl.get("10-Q", "ABT", after_date="20160101")

In [11]:
# Creating the path to list of 10-Qs
ABT_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/ABT/10-Q/" 

# Accenture Plc: ACN

In [19]:
# Get all 10-Q filings (ticker: ACN )
# dl.get("10-Q", "ACN", after_date="20160101")

In [12]:
# Creating the path to list of 10-Qs
ACN_PATH = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/ACN/10-Q/"

# Allegiant Travel Co: ALGT

In [21]:
# Get all 10-Q filings (ticker: ALGT )
# dl.get("10-Q", "ALGT", after_date="20160101")

In [13]:
# Creating the path to list of 10-Qs
ALGT_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/ALGT/10-Q/"

# Baker Hughes Co: BKR

In [23]:
# Get all 10-Q filings (ticker: BKR )
# dl.get("10-Q", "BKR", after_date="20160101")

In [14]:
# Creating the path to list of 10-Qs
BKR_Path =  "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/BKR/10-Q/"

# Baxter International Inc: BAX

In [25]:
# Get all 10-Q filings (ticker: BAX )
# dl.get("10-Q", "BAX", after_date="20160101")

In [15]:
# Creating the path to list of 10-Qs
BAX_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/BAX/10-Q/"   

# Boeing Co: BA

In [27]:
# Get all 10-Q filings (ticker: BA )
# dl.get("10-Q", "BA", after_date="20160101")

In [16]:
# Creating the path to list of 10-Qs
BA_Path =  "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/BA/10-Q/"

# Devon Energy Corp: DVN

In [29]:
# Get all 10-Q filings (ticker: DVN )
# dl.get("10-Q", "DVN", after_date="20160101")

In [17]:
# Creating the path to list of 10-Qs
DVN_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/DVN/10-Q/"

# Goldman Sachs Group Inc: GS

In [31]:
# Get all 10-Q filings (ticker: GS )
# dl.get("10-Q", "GS", after_date="20160101")

In [18]:
# Creating the path to list of 10-Qs
GS_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/GS/10-Q/"

# Johnson & Johnson: JNJ

In [33]:
# Get all 10-Q filings (ticker: JNJ )
# dl.get("10-Q", "JNJ", after_date="20160101")

In [19]:
# Creating the path to list of 10-Qs
JNJ_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/JNJ/10-Q/"

# Mastercard Inc: MA

In [35]:
# Get all 10-Q filings (ticker: MA )
# dl.get("10-Q", "MA", after_date="20160101")

In [20]:
# Creating the path to list of 10-Qs
MA_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/MA/10-Q/" 

# Morgan Stanley: MS

In [37]:
# Get all 10-Q filings (ticker: MS )
# dl.get("10-Q", "MS", after_date="20160101")

In [21]:
# Creating the path to list of 10-Qs
MS_Path =  "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/MS/10-Q/"  

# Philip Morris International Inc.: PM

In [39]:
# Get all 10-Q filings (ticker: PM )
# dl.get("10-Q", "PM", after_date="20160101")

In [22]:
# Creating the path to list of 10-Qs
PM_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/PM/10-Q/"  

# Prudential Financial Inc: PRU

In [41]:
# Get all 10-Q filings (ticker: PRU )
# dl.get("10-Q", "PRU", after_date="20160101")

In [23]:
# Creating the path to list of 10-Qs
PRU_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/PRU/10-Q/"

# Raytheon Technologies Corp: RTX

In [43]:
# Get all 10-Q filings (ticker: RTX )
# dl.get("10-Q", "RTX", after_date="20160101")

In [24]:
# Creating the path to list of 10-Qs
RTX_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/RTX/10-Q/"

# Simon Property Group Inc: SPG

In [45]:
# Get all 10-Q filings (ticker: SPG)
# dl.get("10-Q", "SPG", after_date="20160101")

In [25]:
# Creating the path to list of 10-Qs
SPG_Path =  "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/SPG/10-Q/"   

# Skywest Inc: SKYW

In [47]:
# Get all 10-Q filings (ticker: SKYW )
# dl.get("10-Q", "SKYW", after_date="20160101")

In [26]:
# Creating the path to list of 10-Qs
SKYW_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/SKYW/10-Q/"

# Southern Co: SO

In [49]:
# Get all 10-Q filings (ticker: SO )
# dl.get("10-Q", "SO", after_date="20160101")

In [27]:
# Creating the path to list of 10-Qs
SO_Path = "/Users/boimoriba/Documents/Learn.Co_Docs/Projects/Capstone/QQuarterlyInc/sec_edgar_filings/SO/10-Q/"

## Recap Objectives:

- Download needed documentation from SEC using edgar **Complete**
- Filter documentation to only contain NYSE company files  **Complete**
- Turn docments into data frames for an easier cleaning process

# Creating The Data Frames
For each path to the folders holding the needed documents for sentiment analysis, below you will find them placed in a list that will then be used in a function to turn each folder of documents into a data frame. 

In [28]:
doc_path_list = [MMM_Path, ABT_Path, ACN_PATH,
                 ALGT_Path, BKR_Path, BAX_Path,
                 BA_Path, DVN_Path, GS_Path,
                 JNJ_Path, MA_Path, MS_Path, 
                 PM_Path, PRU_Path, RTX_Path,
                 SPG_Path, SKYW_Path, SO_Path]

**The link for the code below is located in the reference section, as well as individuals that assisted in the coding process: Lindsey & Bryan.**

In [58]:
'''The function below will take in a list of paths to folders holding 10-Q documents 
and do the following: Reading in and parsingthe document, Remove all script and style elements, 
Assigning what's left to a string, Normalizing text format, and Get rid of characters and digits 
in document. The function will then return a dataframe  of the cleaned text with its corresponding 
file name note that del gc.collect was added to clear up memory on the computer for the function 
to know what to hold on to and what to get rid of.'''

def df_conversion(path_list):
    # Initialize dictionary
    file_name_and_text = {}
    
    # Extract filenames
    for path in path_list:
        file_names = os.listdir(path)
    
    # Parse text for each file
    for file in file_names:
        if '.txt' in file:
            with open(path + file, "r", -1, encoding="windows-1252", errors='ignore') as target_file:
                text = target_file.read()
            target_file.close()

            text = get_text(text)
            text = unicodedata.normalize("NFKD", text)
            text = clean(text,
                         all=True,
                         extra_spaces=True,
                         stemming=True,
                         stopwords=True,
                         lowercase=True,
                         numbers=True,
                         punct=True,
                         stp_lang='english')

            # Store text in dictionary
            file_name_and_text[file] = text 
            del text; gc.collect()
    
    # Cast as dataframe
    file_data = (pd.DataFrame.from_dict(file_name_and_text, orient='index')
                 .reset_index()
                 .rename(index = str, columns = {'index': 'file_name', 0: 'text'}))
    
    return file_data
   

In [59]:
df = df_conversion(doc_path_list)

## Recap Objectives:

- Download needed documentation from SEC using edgar **Complete**
- Filter documentation to only contain NYSE company files  **Complete**
- Turn docments into data frames for an easier cleaning process **Complete**

# Conclusion
Now that the df is made, the next step is to convert it to a csv file for easier access and further cleaning of the text. The last bits of cleaning and the modeling will be done in the next notebook. Be advised, if you use this function try it on one company first! 

In [118]:
# Saving in 'Feather' format
# df.reset_index().to_feather('10-Qs.feather')

# Resources 

**The Idea**

The Blog That Led to This Project: https://towardsdatascience.com/useful-sentiment-analysis-mining-sec-filings-part-1-358942fc98ed

**Libraries**

Downloading 10-Q: https://sec-edgar-downloader.readthedocs.io/en/latest/

Edgar Documentation: https://pypi.org/project/edgar/

OS: https://www.geeksforgeeks.org/os-module-python-examples/

Inscript: https://pypi.org/project/inscriptis/

Cleantext: https://pypi.org/project/cleantext/

**Companies** 

List of SEC Registered Companies: https://www.sec.gov/Archives/edgar/cik-lookup-data.txt

**Code**

Function for text to df: https://stackoverflow.com/questions/33912773/python-read-txt-files-into-a-dataframe



# Human Resources

Bryan Arnold 02172020 DS Lead Instructor: https://www.linkedin.com/in/bryan-arnold-mathematics/

Lindsey Berlin 02172020 DS Coach: https://www.linkedin.com/in/lindseyberlin/