In this notebook I will read in an Excel file containing Twitter account details for the S&P 500 firms. I'll then modify the data as needed and then save three JSON files containing lists of, respectively, the firms' stock tickers, official firm Twitter accounts, and CEO Twitter accounts. We'll then use those lists in other notebooks to download the Twitter data.

In [2]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [3]:
# http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
# http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_colwidth', 250)

In [4]:
cd C:\\Users\\Gregory\\S&P500

C:\Users\Gregory\S&P500


<br>Read in company data file with stock tickers and Twitter account names for firms and CEOs

In [24]:
df = pd.read_excel('S&P 500 & CEO twitter accounts_updated 8.27.2020.xls')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:2]

# of columns: 9
# of observations: 508


Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub Industry,Headquarters Location,CIK,Company Twitter account,CEO name,CEO Twitter account
0,MMM,3M Company,Industrials,Industrial Conglomerates,"St. Paul, Minnesota",66740.0,https://twitter.com/3M,Mike Roman,N/F
1,ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1800.0,https://twitter.com/AbbottNews,Robert Ford,N/F


### Edit firm and CEO Twitter account values
Remove 'N/F' values from *Company Twitter account* and *CEO Twitter account*

In [29]:
df['Company Twitter account'] = np.where(df['Company Twitter account']=='N/F', '', df['Company Twitter account'])
df['CEO Twitter account'] = np.where(df['CEO Twitter account']=='N/F', '', df['CEO Twitter account'])
df[25:30]

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub Industry,Headquarters Location,CIK,Company Twitter account,CEO name,CEO Twitter account
25,MO,Altria Group Inc,Consumer Staples,Tobacco,"Richmond, Virginia",764180.0,https://twitter.com/AltriaNews,Billy Gifford,
26,AMZN,Amazon.com Inc.,Consumer Discretionary,Internet & Direct Marketing Retail,"Seattle, Washington",1018724.0,https://twitter.com/amazon,Jeff Bezos,https://twitter.com/JeffBezos
27,AMCR,Amcor plc,Materials,Paper Packaging,"Warmley, Bristol, United Kingdom",1748790.0,,Ronald Stephen Delia,
28,AEE,Ameren Corp,Utilities,Multi-Utilities,"St. Louis, Missouri",1002910.0,https://twitter.com/AmerenCorp,Warner Baxter,
29,AAL,American Airlines Group,Industrials,Airlines,"Fort Worth, Texas",6201.0,https://twitter.com/AmericanAir,Doug Parker,


### Save tickers list

In [30]:
tickers = df['Symbol'].tolist()
print(len(tickers), len(set(tickers)))
tickers[:5]

508 508


['MMM', 'ABT', 'ABBV', 'ABMD', 'ACN']

In [31]:
import json 
with open('sp500_tickers.json', 'w') as fp:
    json.dump(tickers, fp)

### Edit and save firm Twitter account list
Use *set* command to remove duplcates

In [33]:
firms = df['Company Twitter account'].tolist()
print(len(firms), len(set(firms)))
firms = list(set(firms))
print(len(firms))
firms[:5]

508 458
458


['',
 'https://twitter.com/skyworksinc',
 'https://twitter.com/DominionEnergy',
 'https://twitter.com/Cisco',
 'https://twitter.com/Huntington_Bank']

<br>Check whether numeric 'missing' value is in firm list

In [43]:
np.nan in firms

False

<br>Remove '' value from list

In [36]:
firms.remove('')
print(len(firms))
firms[:5]

457


['https://twitter.com/skyworksinc',
 'https://twitter.com/DominionEnergy',
 'https://twitter.com/Cisco',
 'https://twitter.com/Huntington_Bank',
 'https://twitter.com/alphabetlnc']

<br>Remove 'https://twitter.com/' from each account name

In [37]:
firms = [f.replace('https://twitter.com/', '') for f in firms]
print(len(firms), len(set(firms)))
firms = list(set(firms))
print(len(firms))
firms[:5]

457 457
457


['FISGlobal', 'CharlesSchwab', 'generalelectric', 'PNCBank', 'comcast']

<br>Save list in JSON format

In [52]:
import json 
with open('sp500_firms.json', 'w') as fp:
    json.dump(firms, fp)

### Save CEO Twitter account list

In [48]:
ceos = df['CEO Twitter account'].tolist()
print(len(ceos), len(set(ceos)))
ceos = list(set(ceos))
print(len(ceos))
ceos[:5]

508 77
77


['',
 nan,
 'https://twitter.com/MikeSievert',
 'https://twitter.com/MattMaddox_',
 'https://twitter.com/PatrickKDecker']

In [49]:
np.nan in ceos

True

In [50]:
ceos.remove('')
ceos.remove(np.nan)
print(len(ceos))
ceos[:5]

75


['https://twitter.com/MikeSievert',
 'https://twitter.com/MattMaddox_',
 'https://twitter.com/PatrickKDecker',
 'https://twitter.com/Corie_Barry',
 'https://twitter.com/stevemollenkopf']

In [51]:
ceos = [c.replace('https://twitter.com/', '') for c in ceos]
print(len(ceos), len(set(ceos)))
ceos = list(set(ceos))
print(len(ceos))
ceos[:5]

75 75
75


['micronceo', 'Corie_Barry', 'gary_kelly', 'KenXieFortinet', 'ThomasAFanning']

In [53]:
import json 
with open('sp500_ceos.json', 'w') as fp:
    json.dump(ceos, fp)