# Data module class 2
Reading documentation: Pandas and BeautifulSoup

In [156]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [157]:
# download and import BeautifulSoup if you need to
# !pip install beautifulsoup4

## Pandas
### Terminology reference
#### Data structures
##### 1-dimensional data (create Series)

|pandas abbreviation|definition|example|
|---|---|---|
|dict|Python dictionary|`{'a': 'value', 'b': 'value'}`|
|ndarray|N-dimensional array (can be 1 or 2 dimensional)|`[0, 1, 2, 3]`|
|scalar|Single value|`100`|
|list|Python list|`[0, 1, 2, 3]`|

##### 2-dimensional data (create DataFrames)

|pandas term|example|
|---|---|
|ndarray|`[[0, 1, 2, 3], [4, 5, 6, 7]]`|
|dict of ndarrays|`{'one': [1, 2, 3, 4], 'two': [4, 3, 2, 1]}`|
|list of dicts|`[{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}]`|

#### How do these look when loaded in pandas?
[Taken from the Pandas User Guide](https://pandas.pydata.org/docs/user_guide/dsintro.html)

In [158]:
pd.Series({'a': 'value', 'b': 'value'})

a    value
b    value
dtype: object

In [159]:
pd.Series([0, 1, 2, 3])

0    0
1    1
2    2
3    3
dtype: int64

In [160]:
pd.Series(5)

0    5
dtype: int64

In [161]:
pd.DataFrame([{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}])

Unnamed: 0,id,info
0,1,text
1,2,more text


#### Other terms
[See pd.to_datetime() as an example](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime)

#### parameters: Information that a function accepts 
- args
    - Arguments that are required (or things that the function needs in order to run)
    - i.e. data for your DataFrame
- kwargs (even though Pandas does not identify them as such)
    - Keyword arguments: optional arguments not necessary for a function to run, but will tell the function to behave in a different way than the default. Called "keyword" arguments because you have to identify the name of the variable
    - i.e. errors='raise'

### 1. Let's practice input/output with Pandas with the following links.
Use Panda's [IO Tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) section of their documentation to grab these datasets

- [Avengers Wikia data - FiveThirtyEight](https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv) | [Documentation here](https://github.com/fivethirtyeight/data/tree/master/avengers)
- [List of sovereign states - Wikipedia](https://en.wikipedia.org/wiki/List_of_sovereign_states)
- [Homeless housing - LA Times](https://raw.githubusercontent.com/kyleykim/R_Scripts/master/la-me-ln-hhh-unequal/revised_data/master_data_geocoded.csv) | [Documentation](https://github.com/kyleykim/R_Scripts/tree/master/la-me-ln-hhh-unequal)

In [183]:
countries = pd.read_html('https://en.wikipedia.org/wiki/List_of_sovereign_states')
avengers = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv')
homeless = pd.read_csv('https://raw.githubusercontent.com/kyleykim/R_Scripts/master/la-me-ln-hhh-unequal/revised_data/master_data_geocoded.csv')

### 2. Let's practice working with missing data and selecting these values
#### For each DataFrame, either select all the missing values of one column or select a unique categorical value.
The [Indexing and selecting data¶](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) section of Pandas documentation will help

#### a. Avengers

In [163]:
avengers.head(10)

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
0,1678,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043.0,Aug-62,1962.0
1,7139,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0
2,64786,"Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061.0,Oct-74,1974.0
3,1868,"Iron Man (Anthony \""Tony\"" Stark)",\/Iron_Man_(Anthony_%22Tony%22_Stark),Public Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2961.0,Mar-63,1963.0
4,2460,Thor (Thor Odinson),\/Thor_(Thor_Odinson),No Dual Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,2258.0,Nov-50,1950.0
5,2458,Benjamin Grimm (Earth-616),\/Benjamin_Grimm_(Earth-616),Public Identity,Good Characters,Blue Eyes,No Hair,Male Characters,,Living Characters,2255.0,Nov-61,1961.0
6,2166,Reed Richards (Earth-616),\/Reed_Richards_(Earth-616),Public Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,2072.0,Nov-61,1961.0
7,1833,Hulk (Robert Bruce Banner),\/Hulk_(Robert_Bruce_Banner),Public Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,2017.0,May-62,1962.0
8,29481,Scott Summers (Earth-616),\/Scott_Summers_(Earth-616),Public Identity,Neutral Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1955.0,Sep-63,1963.0
9,1837,Jonathan Storm (Earth-616),\/Jonathan_Storm_(Earth-616),Public Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,1934.0,Nov-61,1961.0


In [164]:
avengers.ALIVE.unique()

array(['Living Characters', 'Deceased Characters', nan], dtype=object)

In [165]:
avengers[avengers['APPEARANCES'].isna()]

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
15280,743309,Minister of Castile D'or (Earth-616),\/Minister_of_Castile_D%27or_(Earth-616),No Dual Identity,Neutral Characters,,,Male Characters,,Deceased Characters,,Dec-39,1939.0
15281,645438,Mr. Harris' Secretary (Earth-616),\/Mr._Harris%27_Secretary_(Earth-616),No Dual Identity,Neutral Characters,,Blond Hair,Female Characters,,Living Characters,,Oct-39,1939.0
15282,331151,N'Jaga (Earth-616),\/N%27Jaga_(Earth-616),No Dual Identity,Bad Characters,,,Male Characters,,Living Characters,,Oct-39,1939.0
15283,505986,Ertve (Earth-616),\/Ertve_(Earth-616),Secret Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,,Feb-40,1940.0
15284,19657,Invisible Man (Gade) (Earth-616),\/Invisible_Man_(Gade)_(Earth-616),Secret Identity,Good Characters,,,Male Characters,,Living Characters,,Apr-40,1940.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
16371,657508,Ru'ach (Earth-616),\/Ru%27ach_(Earth-616),No Dual Identity,Bad Characters,Green Eyes,No Hair,Male Characters,,Living Characters,,,
16372,665474,Thane (Thanos' son) (Earth-616),\/Thane_(Thanos%27_son)_(Earth-616),No Dual Identity,Good Characters,Blue Eyes,Bald,Male Characters,,Living Characters,,,
16373,695217,Tinkerer (Skrull) (Earth-616),\/Tinkerer_(Skrull)_(Earth-616),Secret Identity,Bad Characters,Black Eyes,Bald,Male Characters,,Living Characters,,,
16374,708811,TK421 (Spiderling) (Earth-616),\/TK421_(Spiderling)_(Earth-616),Secret Identity,Neutral Characters,,,Male Characters,,Living Characters,,,


In [166]:
# show any row with nan
avengers[avengers.isnull().any(axis=1)]


Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
0,1678,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043.0,Aug-62,1962.0
1,7139,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0
2,64786,"Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061.0,Oct-74,1974.0
3,1868,"Iron Man (Anthony \""Tony\"" Stark)",\/Iron_Man_(Anthony_%22Tony%22_Stark),Public Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2961.0,Mar-63,1963.0
4,2460,Thor (Thor Odinson),\/Thor_(Thor_Odinson),No Dual Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,2258.0,Nov-50,1950.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
16371,657508,Ru'ach (Earth-616),\/Ru%27ach_(Earth-616),No Dual Identity,Bad Characters,Green Eyes,No Hair,Male Characters,,Living Characters,,,
16372,665474,Thane (Thanos' son) (Earth-616),\/Thane_(Thanos%27_son)_(Earth-616),No Dual Identity,Good Characters,Blue Eyes,Bald,Male Characters,,Living Characters,,,
16373,695217,Tinkerer (Skrull) (Earth-616),\/Tinkerer_(Skrull)_(Earth-616),Secret Identity,Bad Characters,Black Eyes,Bald,Male Characters,,Living Characters,,,
16374,708811,TK421 (Spiderling) (Earth-616),\/TK421_(Spiderling)_(Earth-616),Secret Identity,Neutral Characters,,,Male Characters,,Living Characters,,,


#### b. Countries

In [167]:
df_countries = countries[0]

df_countries

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,,,,
1,UN member states and observer states ↓,,,
2,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing,Abkhazia → See Abkhazia listing
3,Afghanistan – Islamic Republic of Afghanistan,UN member state,,
4,Albania – Republic of Albania,,,
...,...,...,...,...
237,South Ossetia – Republic of South Ossetia–the ...,,Georgia,"A de facto independent state,[70] recognised b..."
238,Taiwan – Republic of China[l],Former UN member and former permanent UN Secur...,People's Republic of China,A state competing (nominally) for recognition ...
239,Transnistria – Pridnestrovian Moldavian Republic,,Moldova,"A de facto independent state,[56] recognised o..."
240,,,,


In [168]:
df_countries[df_countries['Common and formal names'].isnull()]

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,,,,
227,,,,
228,,,,
240,,,,
241,,,,


In [169]:
df_countries['Sovereignty dispute[b]'].unique()

array([nan, 'Abkhazia → See Abkhazia listing',
       'Not recognised by Pakistan.', 'Artsakh → See Artsakh listing',
       'Burma → See Myanmar listing',
       'Partially unrecognised. Republic of China',
       'China, Republic of → See Taiwan listing',
       'Cook Islands → See Cook Islands listing',
       "Côte d'Ivoire → See Ivory Coast listing",
       'Not recognised by Turkey[13]',
       "Democratic People's Republic of Korea → See Korea, North listing",
       'Democratic Republic of the Congo → See Congo, Democratic Republic of the listing',
       'Holy See → See Vatican City listing', 'Partially unrecognised',
       'South Korea', 'North Korea', 'Kosovo → See Kosovo listing',
       'Macedonia → See North Macedonia listing',
       'Nagorno-Karabakh → See Artsakh listing',
       'Niue → See Niue listing',
       'North Korea → See Korea, North listing',
       'Northern Cyprus → See Northern Cyprus listing',
       'Partially unrecognised. Israel',
       'Pridnestro

#### c. LA homeless housing

In [170]:
homeless_nan = homeless.isnull()
homeless_nan = homeless_nan.any(axis=1)
homeless[homeless_nan]


Unnamed: 0,project_name,address,district_no,units,sh_units,status,lon,lat,geoAddress


In [171]:
homeless.status.unique()

array(['Already approved', 'Pending City Council approval'], dtype=object)

### 3. Let's practice cleaning with intent

#### Use each the three datasets loaded in to generate a question you want to answer with the data
##### Tips
- Show the column list the column types and null values
- Find unique values to look at categorical data

#### a. Avengers
##### Question
How many characteres of each gender per year?

##### What steps do I need to do to answer the question?
- Group by Year
- Count by gender

In [172]:
# show the dataframe info here to get you started 
avengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   page_id           16376 non-null  int64  
 1   name              16376 non-null  object 
 2   urlslug           16376 non-null  object 
 3   ID                12606 non-null  object 
 4   ALIGN             13564 non-null  object 
 5   EYE               6609 non-null   object 
 6   HAIR              12112 non-null  object 
 7   SEX               15522 non-null  object 
 8   GSM               90 non-null     object 
 9   ALIVE             16373 non-null  object 
 10  APPEARANCES       15280 non-null  float64
 11  FIRST APPEARANCE  15561 non-null  object 
 12  Year              15561 non-null  float64
dtypes: float64(2), int64(1), object(10)
memory usage: 1.6+ MB


In [173]:
rank = avengers.groupby(by=['Year','SEX']).size().reset_index(name='counts')

rank

Unnamed: 0,Year,SEX,counts
0,1939.0,Female Characters,10
1,1939.0,Male Characters,56
2,1940.0,Female Characters,33
3,1940.0,Male Characters,183
4,1941.0,Female Characters,15
...,...,...,...
170,2012.0,Female Characters,56
171,2012.0,Male Characters,140
172,2013.0,Agender Characters,5
173,2013.0,Female Characters,54


In [174]:
import plotly.express as px

px.line(rank, x="Year", y="counts", title='Characters gender by year',color='SEX')

In [194]:
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of sovereign states - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"de8c3a83-da8d-43d1-8b18-4425b0f14b88","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_sovereign_states","wgTitle":"List of sovereign states","wgCurRevisionId":1029521604,"wgRevisionId":1029521604,"wgArticleId":68253,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","CS1 maint: archived copy as title","Webarchive template other archives","CS1 uses Russian-language s

#### b. Countries
##### Question
- What are countries involved in more disputes? 
# Still to be finished
##### What cleaning do I need to do to answer the question
- 
- 
- 

In [193]:
# pd.read_html(result.content)[0].tail(20)

result = requests.get('https://en.wikipedia.org/wiki/List_of_sovereign_states')
soup = BeautifulSoup(result.content, 'lxml')
tables = soup.find_all('table')
pd.read_html(str(tables[0]))[0].tail(20)

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
222,Vietnam – Socialist Republic of Vietnam,,,
223,Yemen – Republic of Yemen,,,
224,Zambia – Republic of Zambia,,,
225,Zimbabwe – Republic of Zimbabwe,,,
226,UN member states and observer states ↑,,,
227,,,,
228,,,,
229,Abkhazia – Republic of Abkhazia,,Georgia,"Recognised by Russia, Nauru, Nicaragua, Syria,..."
230,Artsakh – Republic of Artsakh[ag],,Azerbaijan,"A de facto independent state,[56][57][58] reco..."
231,Cook Islands,UN specialized agencies,(See political status),"A state in free association with New Zealand, ..."


In [190]:
countries = pd.read_html('https://en.wikipedia.org/wiki/List_of_sovereign_states')

countries[0].tail(5)

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
237,South Ossetia – Republic of South Ossetia–the ...,,Georgia,"A de facto independent state,[70] recognised b..."
238,Taiwan – Republic of China[l],Former UN member and former permanent UN Secur...,People's Republic of China,A state competing (nominally) for recognition ...
239,Transnistria – Pridnestrovian Moldavian Republic,,Moldova,"A de facto independent state,[56] recognised o..."
240,,,,
241,,,,


In [153]:
df_countries.columns = ['name','un','dispute','info']

df_countries

Unnamed: 0,name,un,dispute,info
3,Afghanistan – Islamic Republic of Afghanistan,UN member state,,
4,Albania – Republic of Albania,,,
5,Algeria – People's Democratic Republic of Algeria,,,
6,Andorra – Principality of Andorra,,,Andorra is a co-principality in which the offi...
7,Angola – Republic of Angola,,,
...,...,...,...,...
234,Northern Cyprus – Turkish Republic of Northern...,,Republic of Cyprus,"Recognised only by Turkey. Under the name ""Tur..."
235,Sahrawi Arab Democratic Republic,,Morocco,Recognised at some stage by 84 UN member state...
236,Somaliland – Republic of Somaliland,,Somalia,"A de facto independent state,[56][65][66][67][..."
237,South Ossetia – Republic of South Ossetia–the ...,,Georgia,"A de facto independent state,[70] recognised b..."


In [155]:
df_countries['dispute'].value_counts()

Georgia                                                                             2
(See political status)                                                              2
Serbia                                                                              1
Azerbaijan                                                                          1
The Bahamas → See Bahamas, The listing                                              1
Not recognised by Pakistan.                                                         1
Burma → See Myanmar listing                                                         1
The Gambia → See Gambia, The listing                                                1
Northern Cyprus → See Northern Cyprus listing                                       1
Sudan, South → See South Sudan listing                                              1
Partially unrecognised                                                              1
Taiwan (Republic of China) → See Taiwan listing       

#### c. LA homeless housing
##### Question
- % of projects per status

##### What cleaning do I need to do to answer the question
- 
- 
- 

In [202]:
homeless

Unnamed: 0,project_name,address,district_no,units,sh_units,status,lon,lat,geoAddress
0,Reseda Theater Senior Housing (Canby Woods West),7221 N CANBY AVE CA 91335,3,26,13,Already approved,-118.535105,34.201798,"7221 canby ave, reseda, ca 91335, usa"
1,Main Street Apartments,5501 S MAIN ST CA 90037,9,57,56,Already approved,-118.274276,33.992203,"5501 s main st, los angeles, ca 90037, usa"
2,Berendo Sage,1035 S BERENDO ST CA 90006,1,42,21,Already approved,-118.294014,34.051678,"1035 s berendo st, los angeles, ca 90006, usa"
3,South Main Street Apartments,12003 S MAIN ST CA 90061,15,62,61,Already approved,-118.274250,33.923439,"12003 s main st, los angeles, ca 90061, usa"
4,Montecito II Senior Housing,6668 W FRANKLIN AVE HOLLYWOOD CA 90028,13,64,32,Already approved,-118.335282,34.105027,"6668 franklin ave, los angeles, ca 90028, usa"
...,...,...,...,...,...,...,...,...,...
74,4719 Normandie,4719 S NORMANDIE AVE 90037,8,48,47,Pending City Council approval,-118.300502,34.000387,"4719 normandie ave, los angeles, ca 90037, usa"
75,Amani Apartments (PICO),4200 W PICO BLVD 90019,10,55,54,Pending City Council approval,-118.327182,34.047553,"4200 pico blvd, los angeles, ca 90019, usa"
76,Mariposa Lily,1055 S MARIPOSA AVE 90006,1,41,20,Pending City Council approval,-118.299164,34.051089,"1055 s mariposa ave, los angeles, ca 90006, usa"
77,410 E. Florence Avenue,410 E. Florence Ave. 90003,9,51,50,Pending City Council approval,-118.267063,33.974401,"410 e florence ave, los angeles, ca 90003, usa"


In [201]:
homeless['status'].value_counts(normalize=True)

Already approved                 0.696203
Pending City Council approval    0.303797
Name: status, dtype: float64

Take a look at the [LA Times'](https://github.com/datadesk/notebooks) or [FiveThirtyEight's](https://github.com/fivethirtyeight/data) for more practice

## BeautifulSoup
[BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [213]:
# load in the HTML and format for BS
sp_wiki_url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

import requests

html = requests.get(sp_wiki_url).content

soup = BeautifulSoup(html)

soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of S&amp;P 500 companies - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"af1d68f5-b15c-4c19-a90d-36d17b41a906","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_S\u0026P_500_companies","wgTitle":"List of S\u0026P 500 companies","wgCurRevisionId":1031798218,"wgRevisionId":1031798218,"wgArticleId":2676045,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with dead external links","Articles with dead external links from November 2019","Articles with permanen

In [212]:
# find the title tag

soup.title

<title>List of S&amp;P 500 companies - Wikipedia</title>

In [214]:
# grab the first a tag

soup.a

<a id="top"></a>

In [216]:
# finds all a tags

soup.find_all('a')

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a href="/wiki/S%26P_500" title="S&amp;P 500">S&amp;P 500</a>,
 <a href="/wiki/Stock_market_index" title="Stock market index">stock market index</a>,
 <a href="/wiki/S%26P_Dow_Jones_Indices" title="S&amp;P Dow Jones Indices">S&amp;P Dow Jones Indices</a>,
 <a href="/wiki/Common_stock" title="Common stock">common stocks</a>,
 <a href="/wiki/Market_capitalization" title="Market capitalization">large-cap</a>,
 <a href="/wiki/Dow_Jones_Industrial_Average" title="Dow Jones Industrial Average">Dow Jones Industrial Average</a>,
 <a href="#cite_note-1">[1]</a>,
 <a href="#cite_note-2">[2]</a>,
 <a href="#S&amp;P_500_component_stocks"><span class="tocnumber">1</span> <span class="toctext">S&amp;P 500 component stocks</span></a>,
 <a href="#Selected_changes_to_the_list_of_S&amp;P_500_components"><span class="tocnumber">2</span> <span class="toctext

In [219]:
# find all elements with the class "mw-jump-link"

soup.find_all(class_ = 'mw-jump-link')

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>]

#### Format the first table of the list of S&P 500 companies wiki page as a dataframe

[Traversing the DOM - W3C](https://www.w3.org/wiki/Traversing_the_DOM)

In [238]:
# find where the data you want resides (a tag, class name, etc)
table = pd.read_html(sp_wiki_url,match='CIK')

pd.DataFrame(table[0])

Unnamed: 0,Symbol,Security,SEC filings,GICS Sector,GICS Sub-Industry,Headquarters Location,Date first added,CIK,Founded
0,MMM,3M,reports,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1976-08-09,66740,1902
1,ABT,Abbott Laboratories,reports,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,1800,1888
2,ABBV,AbbVie,reports,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
3,ABMD,Abiomed,reports,Health Care,Health Care Equipment,"Danvers, Massachusetts",2018-05-31,815094,1981
4,ACN,Accenture,reports,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989
...,...,...,...,...,...,...,...,...,...
500,YUM,Yum! Brands,reports,Consumer Discretionary,Restaurants,"Louisville, Kentucky",1997-10-06,1041061,1997
501,ZBRA,Zebra Technologies,reports,Information Technology,Electronic Equipment & Instruments,"Lincolnshire, Illinois",2019-12-23,877212,1969
502,ZBH,Zimmer Biomet,reports,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,1136869,1927
503,ZION,Zions Bancorp,reports,Financials,Regional Banks,"Salt Lake City, Utah",2001-06-22,109380,1873


### We can do more cleaning here