<h1>Pandas</h1>

<li>Integrated data manipulation and analysis capabilities
<li>Integration with data visualization libraries
<li>Integration with machine learning libraries
<li>Built in time-series capabilities (Pandas was originally designed for financial time series data)
<li>Optimized for speed
<li>Built-in support for grabbing data from multiple sources csv, xls, html tables, yahoo, google, worldbank, FRED
<li>Integrated data manipulation support (messy data, missing data, feature construction)
<li><b>End to end support for data manipulation, data visualization, data analysis, and presenting results</b>

<h2>Types of data in data analysis</h2>
<li><b>Categorical</b>  data with a fixed (finite) set of values and not necessarily ordered
<ul>
<li>gender, marital status, income level, semester grade etc.
<li>Pandas uses a <b>Categoricals</b> data type to represent this type of data
</ul>
<li><b>Continuous</b> Data that is drawn from an infinite set of ordered values and there are an infinite number of values between any two data elements
<ul>
<li>stock prices, sales revenue, dollar income, etc.
<li>Pandas uses the numpy <b>float</b> and <b>int</b> types to represent this type of data
</ul>
<li><b>Discrete</b> Data that is numerical but cannot be atomized further
<ul>
<li>Counts of categorical data
<li>Number of males, number of people in an income level, etc.
<li>Pandas uses the numpy <b>int</b> to represent this type of data
</ul>

<h3>Pandas organizes data into two data objects</h3>
<li>Series: A one dimensional array object
<li>DataFrame: A two dimensional table object
<ul>
<li>Each column in a dataframe corresponds to a named series

<li>Rows in a dataframe can be indexed by a column of any datatype
</ul>

In [2]:
import pandas as pd
import numpy as np

<h1>Series</h1>

In [3]:
x = pd.Series(np.random.randint(1000,size=1000))
x[:4]
#declares the length, the pd.series will run up until it has a 1000 values, that is what the first number declares


0    771
1    228
2    639
3    971
dtype: int64

In [4]:
print(x.head())
print(x.tail())
#beginning and end, uk this

0    771
1    228
2    639
3    971
4    192
dtype: int64
995    461
996    688
997    154
998    549
999    279
dtype: int64


<h3>Series are indexed</h3>
<li>Every series contains an index and the values of each index item
<li>Series items must be accessed through the index
<li>Iterators:  iterate on the index returning values 

In [5]:
x[0] #0 is the first location

771

In [6]:
for i in x.index:
    print(x[i])

771
228
639
971
192
336
24
864
466
590
92
60
242
477
227
415
940
745
948
749
302
960
939
61
482
300
545
880
547
249
391
700
143
644
905
317
462
249
60
471
106
24
960
819
346
65
121
513
427
42
768
625
228
624
815
652
373
78
832
210
26
290
465
909
544
951
883
725
722
650
42
587
450
377
424
330
626
168
117
413
905
19
792
647
95
85
383
621
134
406
431
405
611
991
324
887
938
501
348
500
279
294
503
844
342
459
734
430
370
306
979
827
467
256
407
444
558
433
463
28
34
548
193
757
934
347
249
846
897
625
150
744
45
376
166
135
857
534
644
481
41
219
505
372
464
202
851
772
644
557
292
816
37
130
339
684
111
798
759
302
810
271
178
976
918
541
90
417
334
861
373
896
109
355
907
582
34
38
725
798
593
443
660
532
753
816
990
674
874
753
851
705
49
196
124
623
52
304
345
816
805
496
622
274
212
384
579
480
68
173
274
166
630
552
550
492
241
606
307
661
51
182
426
995
832
204
816
253
198
27
814
504
110
561
269
403
330
370
103
411
24
790
153
772
9
103
246
124
274
685
590
996
367
731
776
686
34
689

In [7]:
#The i's in the following loop are values in x, not locations in x!
for i in x:
    print(x[i])

873
198
928
76
49
108
482
405
334
928
611
26
153
823
253
307
695
994
288
610
662
746
662
290
443
481
701
589
602
685
885
499
372
870
562
340
666
685
26
110
734
482
746
699
312
951
548
936
826
960
496
841
198
26
312
616
580
117
976
274
545
192
50
158
498
500
535
929
673
372
960
423
460
249
108
139
511
334
433
251
562
749
282
845
887
85
579
318
166
436
605
544
171
880
663
184
491
104
329
719
9
480
960
120
649
386
219
211
706
602
9
179
904
34
838
824
788
342
188
547
905
170
196
885
211
530
685
857
777
841
292
359
65
400
90
135
974
718
870
862
24
661
26
864
301
622
953
611
870
860
631
219
249
150
44
890
827
774
95
662
866
170
725
604
657
311
431
112
782
63
580
996
306
153
732
469
905
60
929
774
588
683
239
272
194
219
459
666
727
194
953
447
42
52
934
641
228
182
298
219
612
49
231
788
630
966
737
41
722
355
788
90
979
318
184
378
790
796
761
41
625
660
363
461
976
212
219
731
345
880
335
217
979
135
255
893
139
706
844
658
482
940
130
611
590
844
246
934
788
363
928
688
296
134
901
709
90

In [8]:
for i in x:
    print(i)
    #that prints the value that is stored

771
228
639
971
192
336
24
864
466
590
92
60
242
477
227
415
940
745
948
749
302
960
939
61
482
300
545
880
547
249
391
700
143
644
905
317
462
249
60
471
106
24
960
819
346
65
121
513
427
42
768
625
228
624
815
652
373
78
832
210
26
290
465
909
544
951
883
725
722
650
42
587
450
377
424
330
626
168
117
413
905
19
792
647
95
85
383
621
134
406
431
405
611
991
324
887
938
501
348
500
279
294
503
844
342
459
734
430
370
306
979
827
467
256
407
444
558
433
463
28
34
548
193
757
934
347
249
846
897
625
150
744
45
376
166
135
857
534
644
481
41
219
505
372
464
202
851
772
644
557
292
816
37
130
339
684
111
798
759
302
810
271
178
976
918
541
90
417
334
861
373
896
109
355
907
582
34
38
725
798
593
443
660
532
753
816
990
674
874
753
851
705
49
196
124
623
52
304
345
816
805
496
622
274
212
384
579
480
68
173
274
166
630
552
550
492
241
606
307
661
51
182
426
995
832
204
816
253
198
27
814
504
110
561
269
403
330
370
103
411
24
790
153
772
9
103
246
124
274
685
590
996
367
731
776
686
34
689

<h3>Series and dict</h3>
<li> A dictionary will automatically be broken up into index and value pairs</li>
<li> In the following example, the index is ['a','b','c'] and the series contains [1,2,3]

In [9]:
x = {'a':1,'b':2,'c':3}
y=pd.Series(x)
print(y['b'])
#you can only search by the keys not the values

#i wonder if there is a way to search in a dictionary that will retrurn a list of keys 
#associated with a given value? like flip the x y thing

2


<h3>Series objects work like numpy ndarrays</h3>
<li>but with an independent index attached to the values of the array
<li>the index can be of any immutable data type

In [10]:
nums = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
names = np.array(('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'))
months = pd.Series(nums,index=names)
months['Mar']
#so I indexed the numbers by the months, when I call a month (the index
#aka the key to the month, it will print the associated number
#months


3

<h4>The index attribute returns the index associated with a series<h4>
<li> The data type associated with the index is "pandas index"

In [11]:
months.index #printing the series. objectgives the indices 

Index(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct',
       'Nov', 'Dec'],
      dtype='object')

<h3>Accessing data by row number</h3>
<li>Series objects are considered to be "ordered"
<li>So we can also access objects by row number

In [12]:
months[1]
#months is 1,2,3.... so the first object is 2 because ordering is 0,1...

2

<h4>And we can find the row number given an index value<p>
and then use that to access the data at that row</h4>

In [13]:
row = months.index.get_loc('Mar')
months.iloc[row]
#i is a command kinda thing, iloc


3

<h4>We can do numpy operations on a pandas series</h4>


<b>Scalar multiplication

In [14]:
months*2, #changes the value but the index associated remains unchanged

(Jan     2
 Feb     4
 Mar     6
 Apr     8
 May    10
 Jun    12
 Jul    14
 Aug    16
 Sep    18
 Oct    20
 Nov    22
 Dec    24
 dtype: int64,)

<b>addition

In [15]:
x=pd.Series([1,3,5,7,11])
z = pd.Series([1,2,3,4,5])
x+z

0     2
1     5
2     8
3    11
4    16
dtype: int64

<h4>provided the indexes match</h4>

In [16]:
nums = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
names = np.array(('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'))
months = pd.Series(nums,index=names)
things = ['partridge','turtle dove','french hen','calling birds','golden rings','geese','swans','milking maids',
                'dancing ladies','leaping lords','piping pipers','drumming drummers' ]
days_of_xmas = pd.Series(nums,things)
months + days_of_xmas

Apr                 NaN
Aug                 NaN
Dec                 NaN
Feb                 NaN
Jan                 NaN
Jul                 NaN
Jun                 NaN
Mar                 NaN
May                 NaN
Nov                 NaN
Oct                 NaN
Sep                 NaN
calling birds       NaN
dancing ladies      NaN
drumming drummers   NaN
french hen          NaN
geese               NaN
golden rings        NaN
leaping lords       NaN
milking maids       NaN
partridge           NaN
piping pipers       NaN
swans               NaN
turtle dove         NaN
dtype: float64

<h3>Timeseries objects</h3>
<li>Timeseries data in pandas is represented by a series
<li>Indexed by time
<li>A series can be read directly from a csv file
<li>And then the str date converted into a Timestamp object

In [17]:
gs_price_data = pd.read_csv("GS.csv",index_col="Date")
#that csv file does not exist in my folder

FileNotFoundError: File b'GS.csv' does not exist

In [18]:
gs_price_data.index[0]

NameError: name 'gs_price_data' is not defined

In [19]:
gs_price_data.index = pd.to_datetime(gs_price_data.index)

NameError: name 'gs_price_data' is not defined

In [20]:
gs_price_data.index[0]

NameError: name 'gs_price_data' is not defined

<h3>Accessing data using a Timestamp index</h3>
<li>Key values passed to the Series must be of type Timestamp
<li>We need to convert str time into Timestamps
<li>pd.to_datetime will do this for us (we could also use the datetime library)
<li>index.get_loc gets the row number corresponding to the timestamp
<li>and, finally .iloc[row_number] returns the data

In [21]:
dt = pd.to_datetime('2018-08-23')
row = gs_price_data.index.get_loc(dt)
gs_price_data.iloc[row]

NameError: name 'gs_price_data' is not defined

<h4>get_loc can find the 'nearest' or next ('backfill') or use the most recent ('pad')

In [22]:
dt = pd.to_datetime('2018-09-01')
row = gs_price_data.index.get_loc(dt,method="pad") #'nearest', 'pad', 'backfill'
gs_price_data.iloc[row]

NameError: name 'gs_price_data' is not defined

<h4>Statistics on a series</h4>

In [23]:
gs_price_data.mean()
#gs_price_data.std()
#gs_price_data.pct_change()

NameError: name 'gs_price_data' is not defined

<h1>pandas DataFrame</h1>
<li>2-Dimensional structure
<li>Columns can contain data of different types (like an Excel spreadsheet)
<li>Can contain an index (or indices)
<li>Columns (and indeces) can be named


<h3>Constructing a dataframe</h3>

In [24]:
df = pd.DataFrame([[11,22,13],[21,22,23]])
df

Unnamed: 0,0,1,2
0,11,22,13
1,21,22,23


In [25]:
df = pd.DataFrame([[11,22,13],[21,22,23]])
df.columns=['c1','c2','c3']
df.index = ['a','b']
df

Unnamed: 0,c1,c2,c3
a,11,22,13
b,21,22,23


In [26]:
df = pd.DataFrame([[11,22,13],[21,22,23]],index=['a','b'],columns=['c1','c2','c3'])
df

Unnamed: 0,c1,c2,c3
a,11,22,13
b,21,22,23


In [27]:
from datetime import date
date(2018,9,23)

datetime.date(2018, 9, 23)

In [28]:
tickers = ['AAPL','GOOG','GS']
dates = ['20180924','20180925']
data = np.zeros((2,3))
df = pd.DataFrame(data,index=dates,columns=tickers)
df

Unnamed: 0,AAPL,GOOG,GS
20180924,0.0,0.0,0.0
20180925,0.0,0.0,0.0


In [29]:
tickers = ['AAPL','GOOG','GS']
from datetime import date
dates = [date(2018,10,2),date(2018,10,3)]
data = np.zeros((2,3))
df = pd.DataFrame(data,index=dates,columns=tickers)
df

Unnamed: 0,AAPL,GOOG,GS
2018-10-02,0.0,0.0,0.0
2018-10-03,0.0,0.0,0.0


In [30]:
data = {'AAPL':[0.0,0.0],'GOOG':[0.0,0.0],'GS':[0.0,0.0]}
df = pd.DataFrame(data)
df.index = [date(2018,10,2),date(2018,10,3)]
df

Unnamed: 0,AAPL,GOOG,GS
2018-10-02,0.0,0.0,0.0
2018-10-03,0.0,0.0,0.0


<h3>Using an existing column as an index</h3>
<li>By default, pandas will return a copy with the index added
<li>Use the inplace parameter to do modify the df itself

In [31]:
df = pd.DataFrame([['r1','00','01','02'],['r2','10','11','12'],['r3','20','21','22']],columns=['row_label','A','B','C'])
print(df)
df.set_index('row_label',inplace=True)
print(df)
df = pd.DataFrame([['r1','00','01','02'],['r2','10','11','12'],['r3','20','21','22']],columns=['row_label','A','B','C'])
df.set_index('row_label',inplace=False)
print(df)

  row_label   A   B   C
0        r1  00  01  02
1        r2  10  11  12
2        r3  20  21  22
            A   B   C
row_label            
r1         00  01  02
r2         10  11  12
r3         20  21  22
  row_label   A   B   C
0        r1  00  01  02
1        r2  10  11  12
2        r3  20  21  22


<h4>Pandas dataframes work like dictionaries</h4>
<li>Column names can be used to access a column as a series from a df

In [32]:
data = {'AAPL':[217.2,218.7],'GOOG':[1166.2,1161.5],'GS':[235.3,231.1]}
df = pd.DataFrame(data)
df.index = [date(2018,9,21),date(2018,9,24)]


In [33]:
df['AAPL']

2018-09-21    217.2
2018-09-24    218.7
Name: AAPL, dtype: float64

In [34]:
type(df['AAPL'])

pandas.core.series.Series

<h4>Single columns (with no spaces in their names) can also be accessed using the attribute syntax

In [35]:
df.AAPL

2018-09-21    217.2
2018-09-24    218.7
Name: AAPL, dtype: float64

<h4>Chained indexing leads to a specific cell in the table</h4>

In [36]:
df['AAPL'][date(year=2018,month=9,day=24)]

218.7

<h3>Selecting rows</h3>
<li>rows can be selected using the index df.loc[index_value]
<li>or using row number df.iloc[row_number]
<li>Note that both methods use dictionary like indexing!

In [37]:
df.loc[date(2018,9,21)]

AAPL     217.2
GOOG    1166.2
GS       235.3
Name: 2018-09-21, dtype: float64

In [38]:
df.iloc[0]

AAPL     217.2
GOOG    1166.2
GS       235.3
Name: 2018-09-21, dtype: float64

<h4>Accessing a specific value</h4>

In [None]:
df['AAPL'].loc[date(2018,9,21)]
#df.loc[date(2018,9,21)]['AAPL']

<h4>Add a new column</h4>

In [None]:
df['IONS'] = np.NaN
df

<h4>Selecting multiple columns</h4>
<li>Use a <b>list</b> containing the names of the desired rows

In [None]:
df[['AAPL','GOOG']]

In [None]:
df = pd.DataFrame([[11,22,13],[21,22,23]])
df.columns=['c1','c2','c3']
df.index = ['a','b']
df

<h4>Creating a new column using a pattern</h4>

In [None]:
df['Mult3'] =np.where(df['c1']%3==0,1,0)
df

<h3>Slicing</h3>

In [None]:
df = pd.DataFrame([[11,12,13,14,15],
                   [21,22,23,24,25],
                   [31,32,33,34,35],
                   [41,42,43,44,45],
                   [51,52,53,54,55]])
df.index =['r1','r2','r3','r4','r5']
df.columns = ['c1','c2','c3','c4','c5']
df

In [None]:
df.loc['r2':'r4']

In [None]:
df.loc[:,'c2':'c4']

In [None]:
df.loc['r2':'r4','c2':'c4']

In [None]:
df.iloc[1:4,1:4]

<h3>Working with views and copies</h3>

In [None]:
df = pd.DataFrame([[11,12,13,14,15],
                   [21,22,23,24,25],
                   [31,32,33,34,35],
                   [41,42,43,44,45],
                   [51,52,53,54,55]])
df.index =['r1','r2','r3','r4','r5']
df.columns = ['c1','c2','c3','c4','c5']
df_new = df

<h4>df_new points to the same dataframe as df</h4>

In [None]:
print(id(df),id(df_new))

<h4>Changes in df will result in a change in df_new</h4>

In [None]:
df.loc['r3','c3'] = 99
df_new

<h4>To work with a copy, use .copy()</h4>

In [None]:
df = pd.DataFrame([[11,12,13,14,15],
                   [21,22,23,24,25],
                   [31,32,33,34,35],
                   [41,42,43,44,45],
                   [51,52,53,54,55]])
df.index =['r1','r2','r3','r4','r5']
df.columns = ['c1','c2','c3','c4','c5']
df_new = df.copy()
df.loc['r3','c3'] = 99
df_new

<h1>Grouping functionality in Pandas</h1>
<li>Pandas allows grouping by value as well as grouping by functions

In [None]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'D' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three']})
df

<h3>Group by column values</h3>

In [None]:
df.groupby('B')

In [None]:
df.groupby('B').size()

<h3>Group by multiple columns</h3>

In [None]:
df.groupby(['A','C']).size()

<h3>Grouping by function</h3>

In [None]:
import pandas as pd
import numpy as np
people = pd.DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Moe', 'Jill', 'Qing', 'Ariana'])
people

<li>We want to choose a column and group elements in the column into two categories
<li>A cell value less than 0, will belong to the group "Negative"
<li>A cell with value greater or equal to 0, will belong to the group "Positive"

<li>We need to write a function that takes a dataframe, a column, and a row index as arguments
<li>The three arguments together will point to a single value
<li>And we can test the value to see which group it belongs to
<li>And return the group label

In [None]:
def GroupColFunc(df, ind, col):
    if df[col].loc[ind] < 0:
        return 'Negative'
    else:
        return 'Positive'



<li>Finally, we'll pass the function to groupby
<li>Just like groupby used the values foo, bar etc. to group the data,
<li>It will use the values returned by the function to group the data


In [None]:
grouped = people.groupby(lambda x: GroupColFunc(people, x, 'a'))
grouped
#print(grouped.size())

<h3>Group statistics</h3>

In [None]:
grouped.mean()
#grouped.std()
#grouped.count()
#grouped.cumcount()

#%matplotlib inline
#grouped.mean().plot(kind='bar')

<h3>Digression: Lambda functions</h3>
<li>Anonymous or "throw-away" functions
<li>Useful for dataframe operations
<li>Useful for defining functions for simple operations

In [None]:
foo = lambda x,y: x+y
foo(2,3)

<h4>Example: We can change the sort parameter using a function</h4>

In [None]:
x=[(1,2),(4,5),(3,3),(9,1)]
def sort_key(x):
    return x[1]
sorted(x,key=sort_key)

<h3>Or we can "inline" the function</h3>
<li>Makes it more readable

In [None]:
x=[(1,2),(4,5),(3,3),(4,1)]
sorted(x,key=lambda x: x[1])

In [None]:
x=[(1,2),(4,5),(3,3),(4,1)]
sorted(x,key=lambda x: x[0] + x[1])

<h2>Join, merge and concatenate dataframes</h2>
<li>Pandas will try to do a "good" operation


In [None]:
df1 = pd.DataFrame([[1,2,3],[4,5,6]],index=['a','b'],columns=['A','B','C'])
df1

In [None]:
df2 = pd.DataFrame([[7,8,9],[10,11,12]],index=['c','d'],columns=['A','B','C'])
df2

In [None]:
pd.concat([df1,df2])

<h3>Concat can handle column mismatches</h3>

In [None]:
df2 = pd.DataFrame([[7,8,9],[10,11,12]],index=['c','d'],columns=['K','B','C'])
df2

In [None]:
pd.concat([df1,df2])

<h4>Concat works with multiple data frames and creates copies. Append appends to an existing dataframe</h4>

In [None]:
df1.append(df2)

<h3>Join</h3>
<li>Pandas provides a full featured join (like SQL)
<li>https://pandas.pydata.org/pandas-docs/stable/merging.html

In [None]:
df1 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],index=['a','b','c','d'],columns=['A','B','C'])
df1

In [None]:
df2 = pd.DataFrame([[17,18,19],[101,111,121]],index=['c','d'],columns=['K','L','D'])
df2

In [None]:
df1.join(df2)

<h1>Working with Pandas</h1>

In [41]:
pd.__version__

'0.23.0'

In [40]:
#installing pandas libraries
!source activate py36;pip install pandas --upgrade
#!source activate py36;pip install pandas-datareader --upgrade
#!pip install --upgrade html5lib==1.0b8

#There is a bug in the latest version of html5lib so install an earlier version
#Restart kernel after installing html5lib

Could not find conda environment: py36
You can list all discoverable environments with `conda info --envs`.

Collecting pandas
[?25l  Downloading https://files.pythonhosted.org/packages/78/78/50ef81a903eccc4e90e278a143c9a0530f05199f6221d2e1b21025852982/pandas-0.23.4-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (14.6MB)
[K    100% |████████████████████████████████| 14.7MB 2.0MB/s ta 0:00:011
Installing collected packages: pandas
  Found existing installation: pandas 0.23.0
    Uninstalling pandas-0.23.0:
      Successfully uninstalled pandas-0.23.0
Successfully installed pandas-0.23.4
[33mYou are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


<h2>Imports</h2>

In [43]:
import pandas as pd #pandas library
from pandas_datareader import data #data readers (google, html, etc.)
#The following line ensures that graphs are rendered in the notebook
%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt #Plotting library
import datetime as dt #datetime for timeseries support

<h2>Pandas datareader</h2>
<li>Access data from html tables on any web page</li>
<li>Get data from google finance</li>
<li>Get data from the federal reserve</li>
<li>Read csv files</li>

<h3>HTML Tables</h3>
<li>Pandas datareader can read a table in an html page into a dataframe
<li>the read_html function returns a list of all dataframes with one dataframe for each html table on the page

<h4>Example: Read tables from an html page</h4>

In [44]:
import requests
df_list = pd.read_html('https://www.x-rates.com/table/?from=USD&amount=1')
#df_list = pd.read_html("https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml",flavor="bs4")
print(len(df_list))

2


<h4>The page contains two tables</h4>

In [None]:
df1 = df_list[0]
df2 = df_list[1]
print(df1)
print(df2)

<h4>Note that the read_html function has automatically detected the header columns</h4>
<h4>If an index is necessary, we need to explicitly specify it</h4>

In [None]:
df1

In [None]:
df1.set_index('US Dollar',inplace=True)
print(df1)

<h4>Now we can use .loc to extract specific currency rates</h4>

In [None]:
df1.loc['Euro','1.00 USD']

<h2>Getting historical stock prices from yahoo finance</h2>
Usage: DataReader(ticker,source,startdate,enddate)<br>



In [None]:
from pandas_datareader import data as web
import datetime
start=datetime.datetime(2000, 1, 1)
end=datetime.datetime.today()
print(start,end)

#df = web.DataReader('IBM', 'yahoo', start, end)

In [None]:
df

<h2>Datareader documentation</h2>
http://pandas-datareader.readthedocs.io/en/latest/</h2>

<h3>Working with a timeseries data frame</h3>
<li>The data is organized by time with the index serving as the timeline


<h4>Creating new columns</h4>
<li>Add a column to a dataframe
<li>Base the elements of the column on some combination of data in the existing columns
<h4>Example: Number of Days that the stock closed higher than it opened
<li>We'll create a new column with the header "UP"
<li>And use np.where to decide what to put in the column

In [None]:
df['UP']=np.where(df['Close']>df['Open'],1,0)
df

<h3>Get summary statistics</h3>
<li>The "describe" function returns a dataframe containing summary stats for all numerical columns
<li>Columns containing non-numerical data are ignored

In [None]:
df.describe()

<h4>Calculate the percentage of days that the stock has closed higher than its open</h4>

In [None]:
df['UP'].sum()/df['UP'].count()

<h4>Calculate percent changes</h4>
<li>The function pct_change computes a percent change between successive rows (times in  timeseries data)
<li>Defaults to a single time delta
<li>With an argument, the time delta can be changed

In [None]:
df['Close'].pct_change() #One timeperiod percent change

In [None]:
n=13
df['Close'].pct_change(n) #n timeperiods percent change

<h3>NaN support</h3>
Pandas functions can ignore NaNs

In [None]:
n=13
df['Close'].pct_change(n).mean()

<h3>Rolling windows</h3>
<li>"rolling" function extracts rolling windows
<li>For example, the 21 period rolling window of the 13 period percent change 

In [None]:
df['Close'].pct_change(n).rolling(21)

<h4>Calculate something on the rolling windows</h4>

<h4>Example: mean (the 21 day moving average of the 13 day percent change)

In [None]:
n=13
df['Close'].pct_change(n).rolling(21).mean()

<h4>Calculate several moving averages and graph them</h4>

In [None]:
ma_8 = df['Close'].pct_change(n).rolling(window=8).mean()
ma_13= df['Close'].pct_change(n).rolling(window=13).mean()
ma_21= df['Close'].pct_change(n).rolling(window=21).mean()
ma_34= df['Close'].pct_change(n).rolling(window=34).mean()
ma_55= df['Close'].pct_change(n).rolling(window=55).mean()

<h2>Plotting pandas series</h2>
<li>Pandas is tightly integrated with matplotlib, a graphing library
<li>All you need do is call 'plot' on any series
<li>When working on a jupyter notebook, add %matplotlib inline

In [None]:
%matplotlib inline

In [None]:
ma_8.plot()
ma_34.plot()

<h2>Linear regression with pandas</h2>
<h4>Example: TAN is the ticker for a solar ETF. FSLR, RGSE, and SCTY are tickers of companies that build or lease solar panels. Each has a different business model. We'll use pandas to study the risk reward tradeoff between the 4 investments and also see how correlated they are</h4>

In [None]:
!source activate py36;pip install fix-yahoo-finance

In [None]:
import datetime
import pandas_datareader.data as web
import fix_yahoo_finance as yf
start = datetime.datetime(2015,7,1)
end = datetime.datetime(2016,7,1)
solar_df = web.DataReader(['FSLR', 'TAN','RGSE','SPWR'],'yahoo', start,end)['Close']
#solar_df = web.get_data_yahoo(['FSLR', 'TAN','RGSE','SPWR'], start,end)

In [None]:
solar_df

In [None]:
solar_df = solar_df['Close']
solar_df

<h4>Let's calculate returns (the 1 day percent change)</h4>

In [None]:
rets = solar_df.pct_change()
print(rets)

<h4>Let's visualize the relationship between each stock and the ETF</h4>

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(rets.FSLR,rets.TAN)

In [None]:
plt.scatter(rets.RGSE,rets.TAN)

In [None]:
plt.scatter(rets.SPWR,rets.TAN)

<h4>The correlation matrix</h4>

In [None]:
solar_corr = rets.corr()
print(solar_corr)

<h3>Basic risk analysis</h3>
<h4>We'll plot the mean and std or returns for each ticker to get a sense of the risk return profile</h4>

In [None]:
plt.scatter(rets.mean(), rets.std())
plt.xlabel('Expected returns')
plt.ylabel('Standard deviations')
for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
    plt.annotate(
        label, 
        xy = (x, y), xytext = (20, -20),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.show()


<h2>Regressions</h2>
http://statsmodels.sourceforge.net/

<h3>Steps for regression</h3>
<li>Construct y (dependent variable series)
<li>Construct matrix (dataframe) of X (independent variable series)
<li>Add intercept
<li>Model the regression
<li>Get the results
<h3>The statsmodels library contains various regression packages. We'll use the OLS (Ordinary Least Squares) model

In [None]:
import numpy as np
import statsmodels.api as sm
X=solar_df[['FSLR','RGSE','SPWR']]
X = sm.add_constant(X)
y=solar_df['TAN']
model = sm.OLS(y,X,missing='drop')
result = model.fit()
print(result.summary())

<h4>Finally plot the fitted line with the actual y values

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(y)
ax.plot(result.fittedvalues)