# Data module class 2
Reading documentation: Pandas and BeautifulSoup

In [4]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [13]:
# download and import BeautifulSoup if you need to
# !pip install beautifulsoup4

## Pandas
### Terminology reference
#### Data structures
##### 1-dimensional data (create Series)

|pandas abbreviation|definition|example|
|---|---|---|
|dict|Python dictionary|`{'a': 'value', 'b': 'value'}`|
|ndarray|N-dimensional array (can be 1 or 2 dimensional)|`[0, 1, 2, 3]`|
|scalar|Single value|`100`|
|list|Python list|`[0, 1, 2, 3]`|

##### 2-dimensional data (create DataFrames)

|pandas term|example|
|---|---|
|ndarray|`[[0, 1, 2, 3], [4, 5, 6, 7]]`|
|dict of ndarrays|`{'one': [1, 2, 3, 4], 'two': [4, 3, 2, 1]}`|
|list of dicts|`[{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}]`|

#### How do these look when loaded in pandas?
[Taken from the Pandas User Guide](https://pandas.pydata.org/docs/user_guide/dsintro.html)

In [5]:
pd.Series({'a': 'value', 'b': 'value', 'c': 'value'})

a    value
b    value
c    value
dtype: object

In [6]:
pd.Series([0, 1, 2, 3, 4, 5])

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [7]:
pd.Series("Attilla")
pd.Series("Five")
pd.Series(5.25)

0    5.25
dtype: float64

In [8]:
pd.DataFrame([{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}, {'id': 3, 'info': 'even more text'}])

Unnamed: 0,id,info
0,1,text
1,2,more text
2,3,even more text


#### Other terms
[See pd.to_datetime() as an example](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime)

#### parameters: Information that a function accepts 
- args
    - Arguments that are required (or things that the function needs in order to run)
    - i.e. data for your DataFrame
- kwargs (even though Pandas does not identify them as such)
    - Keyword arguments: optional arguments not necessary for a function to run, but will tell the function to behave in a different way than the default. Called "keyword" arguments because you have to identify the name of the variable
    - i.e. errors='raise'

In [9]:
df = pd.DataFrame({'year': [2020, 2021],
                   'month': [6, 3],
                   'day': [4, 5]})
#pd.to_datetime(df)
#0   2015-02-04
#1   2016-03-05

In [10]:
pd.to_datetime(df)

0   2020-06-04
1   2021-03-05
dtype: datetime64[ns]

### 1. Let's practice input/output with Pandas with the following links.
Use Panda's [IO Tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) section of their documentation to grab these datasets

- [Avengers Wikia data - FiveThirtyEight](https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv) | [Documentation here](https://github.com/fivethirtyeight/data/tree/master/avengers)
- [List of sovereign states - Wikipedia](https://en.wikipedia.org/wiki/List_of_sovereign_states)
- [Homeless housing - LA Times](https://raw.githubusercontent.com/kyleykim/R_Scripts/master/la-me-ln-hhh-unequal/revised_data/master_data_geocoded.csv) | [Documentation](https://github.com/kyleykim/R_Scripts/tree/master/la-me-ln-hhh-unequal)

In [11]:
df_Avengers = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv')

In [198]:
df_Avengers.head(3)

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
0,1678,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043.0,Aug-62,1962.0
1,7139,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0
2,64786,"Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061.0,Oct-74,1974.0


In [13]:
url = ('https://en.wikipedia.org/wiki/List_of_sovereign_states')


In [14]:
df_sovereigns = pd.read_html(url)

In [92]:
df_sovereigns = df_sovereigns[0]

In [95]:
df_sovereigns.columns


Index(['Common and formal names', 'Membership within the UN System[a]',
       'Sovereignty dispute[b]',
       'Further information on status and recognition of sovereignty[d]'],
      dtype='object')

In [101]:
df_sovereigns.isnull()

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,True,True,True,True
1,False,True,True,True
2,False,False,False,False
3,False,False,True,True
4,False,True,True,True
...,...,...,...,...
237,False,True,False,False
238,False,False,False,False
239,False,True,False,False
240,True,True,True,True


In [23]:
df_LA = pd.read_csv('https://raw.githubusercontent.com/kyleykim/R_Scripts/master/la-me-ln-hhh-unequal/revised_data/master_data_geocoded.csv')

### 2. Let's practice working with missing data and selecting these values
#### For each DataFrame, either select all the missing values of one column or select a unique categorical value.
The [Indexing and selecting data¶](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) section of Pandas documentation will help

In [36]:
df_Avengers.columns

Index(['page_id', 'name', 'urlslug', 'ID', 'ALIGN', 'EYE', 'HAIR', 'SEX',
       'GSM', 'ALIVE', 'APPEARANCES', 'FIRST APPEARANCE', 'Year'],
      dtype='object')

In [75]:
#df_Avengers['EYE'].unique()
df_hazel = df_Avengers['EYE']=='Hazel Eyes'
df_hazel.value_counts()


False    16300
True        76
Name: EYE, dtype: int64

In [106]:
df_Avengers['HAIR'].unique()
#df_Avengers['HAIR'].isna()

array(['Brown Hair', 'White Hair', 'Black Hair', 'Blond Hair', 'No Hair',
       'Blue Hair', 'Red Hair', 'Bald', 'Auburn Hair', 'Grey Hair',
       'Silver Hair', 'Purple Hair', 'Strawberry Blond Hair',
       'Green Hair', 'Reddish Blond Hair', 'Gold Hair', nan,
       'Orange Hair', 'Pink Hair', 'Variable Hair', 'Yellow Hair',
       'Light Brown Hair', 'Magenta Hair', 'Bronze Hair', 'Dyed Hair',
       'Orange-brown Hair'], dtype=object)

In [107]:
df_Avengers['ID'].unique()

array(['Secret Identity', 'Public Identity', 'No Dual Identity',
       'Known to Authorities Identity', nan], dtype=object)

In [111]:
df_Avengers[df_Avengers['ID'].isna()].shape

(3770, 13)

In [28]:
#No nan from status
df_LA['status'].isna().head(5)

0    False
1    False
2    False
3    False
4    False
Name: status, dtype: bool

In [200]:
df_LA.isna()
#all clean, no missing values?

Unnamed: 0,project_name,address,district_no,units,sh_units,status,lon,lat,geoAddress
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
74,False,False,False,False,False,False,False,False,False
75,False,False,False,False,False,False,False,False,False
76,False,False,False,False,False,False,False,False,False
77,False,False,False,False,False,False,False,False,False


#### a. Avengers

In [203]:
df_Avengers['EYE'].isnull().value_counts()

True     9767
False    6609
Name: EYE, dtype: int64

#### b. Countries

In [102]:
df_sovereigns.isnull()

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,True,True,True,True
1,False,True,True,True
2,False,False,False,False
3,False,False,True,True
4,False,True,True,True
...,...,...,...,...
237,False,True,False,False
238,False,False,False,False
239,False,True,False,False
240,True,True,True,True


In [115]:
#df_sovereigns.info
df_sovereigns[df_sovereigns['Common and formal names'].isna()]

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,,,,
227,,,,
228,,,,
240,,,,
241,,,,


#### c. LA homeless housing

In [160]:
df_LA.info()
#df_LA[df_LA['address'].isna()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   project_name  79 non-null     object 
 1   address       79 non-null     object 
 2   district_no   79 non-null     int64  
 3   units         79 non-null     int64  
 4   sh_units      79 non-null     int64  
 5   status        79 non-null     object 
 6   lon           79 non-null     float64
 7   lat           79 non-null     float64
 8   geoAddress    79 non-null     object 
dtypes: float64(2), int64(3), object(4)
memory usage: 5.7+ KB


In [126]:
#df_LA['status'].unique()
df_LA[df_LA['status']!='Pending City Council approval'].head(5)
#len(df_LA[df_LA['status']!='Pending City Council approval'])

Unnamed: 0,project_name,address,district_no,units,sh_units,status,lon,lat,geoAddress
0,Reseda Theater Senior Housing (Canby Woods West),7221 N CANBY AVE CA 91335,3,26,13,Already approved,-118.535105,34.201798,"7221 canby ave, reseda, ca 91335, usa"
1,Main Street Apartments,5501 S MAIN ST CA 90037,9,57,56,Already approved,-118.274276,33.992203,"5501 s main st, los angeles, ca 90037, usa"
2,Berendo Sage,1035 S BERENDO ST CA 90006,1,42,21,Already approved,-118.294014,34.051678,"1035 s berendo st, los angeles, ca 90006, usa"
3,South Main Street Apartments,12003 S MAIN ST CA 90061,15,62,61,Already approved,-118.27425,33.923439,"12003 s main st, los angeles, ca 90061, usa"
4,Montecito II Senior Housing,6668 W FRANKLIN AVE HOLLYWOOD CA 90028,13,64,32,Already approved,-118.335282,34.105027,"6668 franklin ave, los angeles, ca 90028, usa"


### 3. Let's practice cleaning with intent

#### Use each the three datasets loaded in to generate a question you want to answer with the data
##### Tips
- Show the column list the column types and null values
- Find unique values to look at categorical data

#### a. Avengers
##### Question
- How many characters have green eyes and blond hair? 

##### What steps do I need to do to answer the question?
- Look at frame info for the basics, call for uniques to see how the file words eye/hair coloring, ask if there are missing values. 
- Do value counts for green eyes/blond hair
- Do an "&" operator to get Avengers with both those traits

In [171]:
# show the dataframe info here to get you started 
#df_Avengers.info()
#df_Avengers['EYE'].unique()
#df_Avengers['EYE'].isna()
#df_Avengers['HAIR'].unique()
#df_green = df_Avengers['EYE']=='Green Eyes'
#df_green.value_counts()
df_blond = df_Avengers['HAIR']=='Blond Hair'
df_blond.value_counts()

False    14794
True      1582
Name: HAIR, dtype: int64

In [172]:
len(df_Avengers[df_Avengers['HAIR']=='Blond Hair'])

1582

In [143]:
df_both=(df_Avengers['EYE']=='Green Eyes') & (df_Avengers['HAIR']=='Blond Hair')
df_both.value_counts()

False    16318
True        58
dtype: int64

In [147]:
#df[df['Location'].str.contains(", NY", na=False)]
#for avenger in df_Avengers:
#    if df_Avengers['EYE']=='Green Eyes':
#        print(df_Avengers['name'])

#### b. Countries
##### Question
- How many countries are in sovereignty disputes?

##### What cleaning do I need to do to answer the question
- Get exact column names
- Look for missing data? Is this necessary?
- Do length counts

In [162]:
df_sovereigns.columns
#df_sovereigns.info

Index(['Common and formal names', 'Membership within the UN System[a]',
       'Sovereignty dispute[b]',
       'Further information on status and recognition of sovereignty[d]'],
      dtype='object')

In [158]:
#df_sovereigns[df_sovereigns['Sovereignty dispute[b]'].isna()].count()
df_sovereigns['Sovereignty dispute[b]'].isna()

0       True
1       True
2      False
3       True
4       True
       ...  
237    False
238    False
239    False
240     True
241     True
Name: Sovereignty dispute[b], Length: 242, dtype: bool

In [187]:
len(df_sovereigns['Sovereignty dispute[b]'].unique())

46

# c. LA homeless housing
##### Question
- How many shelters are approved?

##### What cleaning do I need to do to answer the question
- Get column names
- Get uniques
- Count

In [190]:
df_LA.info()
#df_LA[df_LA['address'].isna()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   project_name  79 non-null     object 
 1   address       79 non-null     object 
 2   district_no   79 non-null     int64  
 3   units         79 non-null     int64  
 4   sh_units      79 non-null     int64  
 5   status        79 non-null     object 
 6   lon           79 non-null     float64
 7   lat           79 non-null     float64
 8   geoAddress    79 non-null     object 
dtypes: float64(2), int64(3), object(4)
memory usage: 5.7+ KB


In [193]:
df_LA['status'].unique()

array(['Already approved', 'Pending City Council approval'], dtype=object)

In [197]:
#df_LA[df_LA['status']!='Pending City Council approval'].head(5)
len(df_LA[df_LA['status']!='Pending City Council approval'])

55

In [205]:
df_LA['project_name'][0:5]

0    Reseda Theater Senior Housing (Canby Woods West)
1                              Main Street Apartments
2                                        Berendo Sage
3                        South Main Street Apartments
4                         Montecito II Senior Housing
Name: project_name, dtype: object

Take a look at the [LA Times'](https://github.com/datadesk/notebooks) or [FiveThirtyEight's](https://github.com/fivethirtyeight/data) for more practice

## BeautifulSoup
[BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [213]:
# load in the HTML and format for BS
sp_wiki_url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

In [212]:
sp_r = requests.get(sp_wiki_url)


In [221]:
#sp_r.content
sp_bs = BeautifulSoup(sp_r.content)


In [222]:
# find the title tag
sp_bs.title

<title>List of S&amp;P 500 companies - Wikipedia</title>

In [226]:
#sp_bs.a
sp_bs.find('a')
# grab the first a tag
# finds all a tags

<a id="top"></a>

In [228]:
# finds all a tags
#sp_bs.find_all('a')
len(sp_bs.find_all('a'))

3562

In [229]:
# find all elements with the class "mw-jump-link"
sp_bs.find_all(class_='mw-jump-link')

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>]

#### Format the first table of the list of S&P 500 companies wiki page as a dataframe

[Traversing the DOM - W3C](https://www.w3.org/wiki/Traversing_the_DOM)

In [242]:
# find where the data you want resides (a tag, class name, etc)
#tags within tags
#html is a tree, a nested structure, go thru dom tree
sp_tables = sp_bs.find_all('table')

### We can do more cleaning here

In [243]:
len(sp_tables)

2

In [244]:
sp_tables=sp_tables[0]

In [245]:
sp_trs = sp_tables.find_all('tr')

In [249]:
#header row
sp_trs[0]

<tr>
<th><a href="/wiki/Ticker_symbol" title="Ticker symbol">Symbol</a>
</th>
<th>Security</th>
<th><a href="/wiki/SEC_filing" title="SEC filing">SEC filings</a></th>
<th><a href="/wiki/Global_Industry_Classification_Standard" title="Global Industry Classification Standard">GICS</a> Sector</th>
<th>GICS Sub-Industry</th>
<th>Headquarters Location</th>
<th>Date first added</th>
<th><a href="/wiki/Central_Index_Key" title="Central Index Key">CIK</a></th>
<th>Founded
</th></tr>

In [250]:
sp_list = []
for tr in sp_trs[0:1]:
    print(tr)
    

<tr>
<th><a href="/wiki/Ticker_symbol" title="Ticker symbol">Symbol</a>
</th>
<th>Security</th>
<th><a href="/wiki/SEC_filing" title="SEC filing">SEC filings</a></th>
<th><a href="/wiki/Global_Industry_Classification_Standard" title="Global Industry Classification Standard">GICS</a> Sector</th>
<th>GICS Sub-Industry</th>
<th>Headquarters Location</th>
<th>Date first added</th>
<th><a href="/wiki/Central_Index_Key" title="Central Index Key">CIK</a></th>
<th>Founded
</th></tr>


In [251]:
#start with 1
# double loop, loop through all the trs for the tds
#dropping the header row
sp_list = []
for tr in sp_trs[1:]:
    tds = tr.find_all('td')
    tr_list = []
    for td in tds:
        tr_list.append(td.text)
    sp_list.append(tr_list)

In [252]:
#pulls data in an organized way that pandas accepts
sp_list 

[['MMM\n',
  '3M',
  'reports',
  'Industrials',
  'Industrial Conglomerates',
  'Saint Paul, Minnesota',
  '1976-08-09',
  '0000066740',
  '1902\n'],
 ['ABT\n',
  'Abbott Laboratories',
  'reports',
  'Health Care',
  'Health Care Equipment',
  'North Chicago, Illinois',
  '1964-03-31',
  '0000001800',
  '1888\n'],
 ['ABBV\n',
  'AbbVie',
  'reports',
  'Health Care',
  'Pharmaceuticals',
  'North Chicago, Illinois',
  '2012-12-31',
  '0001551152',
  '2013 (1888)\n'],
 ['ABMD\n',
  'Abiomed',
  'reports',
  'Health Care',
  'Health Care Equipment',
  'Danvers, Massachusetts',
  '2018-05-31',
  '0000815094',
  '1981\n'],
 ['ACN\n',
  'Accenture',
  'reports',
  'Information Technology',
  'IT Consulting & Other Services',
  'Dublin, Ireland',
  '2011-07-06',
  '0001467373',
  '1989\n'],
 ['ATVI\n',
  'Activision Blizzard',
  'reports',
  'Communication Services',
  'Interactive Home Entertainment',
  'Santa Monica, California',
  '2015-08-31',
  '0000718877',
  '2008\n'],
 ['ADBE\n',
 

In [254]:
sp_df = pd.DataFrame(sp_list)

In [256]:
sp_df.head(3)


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,MMM\n,3M,reports,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1976-08-09,66740,1902\n
1,ABT\n,Abbott Laboratories,reports,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,1800,1888\n
2,ABBV\n,AbbVie,reports,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)\n


In [265]:
#grab what's behind "reports" in column 3
#enumerate is a way to extract the index
sp_list = []
for tr in sp_trs[1:]:
    tds = tr.find_all('td')
    tr_list = []
    for (i,td) in enumerate(tds):
        if i == 2:
            tr_list.append(td.find('a')['href'])
        else:
            tr_list.append(td.text)
    sp_list.append(tr_list)

In [266]:
sp_df = pd.DataFrame(sp_list)

In [267]:
sp_trs[1]

<tr>
<td><a class="external text" href="https://www.nyse.com/quote/XNYS:MMM" rel="nofollow">MMM</a>
</td>
<td><a href="/wiki/3M" title="3M">3M</a></td>
<td><a class="external text" href="https://www.sec.gov/cgi-bin/browse-edgar?CIK=MMM&amp;action=getcompany" rel="nofollow">reports</a></td>
<td>Industrials</td>
<td>Industrial Conglomerates</td>
<td><a href="/wiki/Saint_Paul,_Minnesota" title="Saint Paul, Minnesota">Saint Paul, Minnesota</a></td>
<td>1976-08-09</td>
<td>0000066740</td>
<td>1902
</td></tr>

In [268]:
sp_df

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,MMM\n,3M,https://www.sec.gov/cgi-bin/browse-edgar?CIK=M...,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1976-08-09,0000066740,1902\n
1,ABT\n,Abbott Laboratories,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,0000001800,1888\n
2,ABBV\n,AbbVie,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,0001551152,2013 (1888)\n
3,ABMD\n,Abiomed,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Health Care Equipment,"Danvers, Massachusetts",2018-05-31,0000815094,1981\n
4,ACN\n,Accenture,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,0001467373,1989\n
...,...,...,...,...,...,...,...,...,...
500,YUM\n,Yum! Brands,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Y...,Consumer Discretionary,Restaurants,"Louisville, Kentucky",1997-10-06,0001041061\n,1997\n
501,ZBRA\n,Zebra Technologies,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Information Technology,Electronic Equipment & Instruments,"Lincolnshire, Illinois",2019-12-23,0000877212\n,1969\n
502,ZBH\n,Zimmer Biomet,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,0001136869\n,1927\n
503,ZION\n,Zions Bancorp,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Financials,Regional Banks,"Salt Lake City, Utah",2001-06-22,0000109380\n,1873\n


In [280]:
sp_header=["Ticker", "Name", "Reports", "Sector", "Subsector", "Location", "Date Added", "Central Index Key", "Founded"]
sp_df = pd.DataFrame(sp_list, columns=sp_header)

In [281]:
sp_df


Unnamed: 0,Ticker,Name,Reports,Sector,Subsector,Location,Date Added,Central Index Key,Founded
0,MMM\n,3M,https://www.sec.gov/cgi-bin/browse-edgar?CIK=M...,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1976-08-09,0000066740,1902\n
1,ABT\n,Abbott Laboratories,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,0000001800,1888\n
2,ABBV\n,AbbVie,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,0001551152,2013 (1888)\n
3,ABMD\n,Abiomed,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Health Care Equipment,"Danvers, Massachusetts",2018-05-31,0000815094,1981\n
4,ACN\n,Accenture,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,0001467373,1989\n
...,...,...,...,...,...,...,...,...,...
500,YUM\n,Yum! Brands,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Y...,Consumer Discretionary,Restaurants,"Louisville, Kentucky",1997-10-06,0001041061\n,1997\n
501,ZBRA\n,Zebra Technologies,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Information Technology,Electronic Equipment & Instruments,"Lincolnshire, Illinois",2019-12-23,0000877212\n,1969\n
502,ZBH\n,Zimmer Biomet,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,0001136869\n,1927\n
503,ZION\n,Zions Bancorp,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Financials,Regional Banks,"Salt Lake City, Utah",2001-06-22,0000109380\n,1873\n


In [282]:
sp_df.to_csv('formatted.2021-07-04.sp500.csv', index=False)
#what do I do with this?