## More on Pandas

In [81]:
import pandas as pd
import numpy as np
import os

os.chdir('/Users/antoniojurlina/Google Drive/Spring 2021/SIE 598/data/')

'/Users/antoniojurlina/Google Drive/Spring 2021/SIE 598/data'

In [2]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [3]:
population['Maine'] =1344000 
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Maine          1344000
dtype: int64

In [4]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [5]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967.0
Florida,19552860,170312.0
Illinois,12882135,149995.0
Maine,1344000,
New York,19651127,141297.0
Texas,26448193,695662.0


### Pandas index object

The index acts like an immutable array or an ordered set (multi-set) as it can contain duplicate values


In [6]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

We saw we can do use standard Python indexing to retrieve values

In [7]:
ind[1]

3

But we can see that indices are immutable–that is, they cannot be modified via the normal means:

In [8]:
ind[1]=4

TypeError: Index does not support mutable operations

This immutability makes it safer to share indices between multiple DataFrames and arrays, without the potential for side effects from inadvertent index modification.
Similar to a primary key in relational databases.

### Index attributes

size, shape, ndim, and dtype - similar to numPy arrays.


In [9]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


In [10]:
print(states.index.size,states.index.shape,states.index.ndim,states.index.dtype)

6 (6,) 1 object


Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. 

Indexes act similar to Pythons built in set type as 
they support intersection, union, and difference operations

In [11]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [12]:
indA & indB  # intersection

Int64Index([3, 5, 7], dtype='int64')

In [13]:
indA | indB  # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [14]:
indA.difference(indB) # elements in a not in b

Int64Index([1, 9], dtype='int64')

In [15]:
indB.difference(indA) # elements in b not in a

Int64Index([2, 11], dtype='int64')

In [16]:
indA ^ indB  # symmetric difference - elements in either a or b but not both

Int64Index([1, 2, 9, 11], dtype='int64')

### Data Indexing and Selection
 
Pandas has methods similar to NumPy for accessing and modifying values in Pandas Series and DataFrames.

#### Data Selection in Series

As we saw in the previous section, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. 

Keeping these analogies in mind, helps to understand the patterns of data indexing and selection in these arrays.

Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:

In [17]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

We can use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [18]:
'a' in data # check if a value is in the Series

True

In [19]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [20]:
data.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [21]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

Slicing however can be a source of some confusion. 

When slicing with an explicit index, the final index is included in the slice.

When slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.

In [22]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [23]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

If a Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices.

A slicing operation like data[1:3] will use the implicit Python-style index.

In [24]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [25]:
# explicit index when indexing
data[1]

'a'

In [26]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

### Indexers: loc, iloc

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. 


- .loc is primarily label based, but may also be used with a boolean array. 

- .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. 



Under loc, if we search for the rows whose index is 1, 2 or 100. We will not get the first, second or the hundredth row here. Instead, we will get the results only if the name of any index is 1, 2 or 100.

Under iloc, if we search for the rows with index 1, 2 or 100, the first, second and hundredth row, will be returned regardless of the name or labels we have in the index in our dataset.

In [27]:
data.loc[1]

'a'

In [28]:
data.iloc[1]

'b'

In [82]:
data=pd.read_csv('soccer.csv')

In [83]:
data.head()

Unnamed: 0,Team,Tournament,Goals,Shots pg,Discipline,Possession_Percent,Pass_percent,AerialsWon,Rating
0,Manchester City,Premier League,46,15.4,250,61.2,88.9,13.9,7.03
1,Aston Villa,Premier League,36,14.3,362,49.5,78.3,20.2,6.96
2,Bayern Munich,Bundesliga,61,16.3,281,59.0,85.5,13.1,6.93
3,Paris Saint-Germain,Ligue 1,57,15.4,465,60.1,89.4,9.6,6.91
4,Barcelona,LaLiga,49,16.0,401,61.6,89.3,10.5,6.88


In [33]:
data.shape

(20, 9)

In [34]:
data.index

RangeIndex(start=0, stop=20, step=1)

In [35]:
data.columns

Index(['Team', 'Tournament', 'Goals', 'Shots pg', 'Discipline',
       'Possession_Percent', 'Pass_percent', 'AerialsWon', 'Rating'],
      dtype='object')

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name.  

Select by columns using the column labels.  The result is a Series

In [36]:
data['Goals']

0     46
1     36
2     61
3     57
4     49
5     45
6     41
7     41
8     50
9     45
10    51
11    40
12    42
13    32
14    41
15    49
16    42
17    54
18    45
19    37
Name: Goals, dtype: int64

In [37]:
c1=data['Goals']

In [38]:
type(c1)

pandas.core.series.Series

In [39]:
data['Goals'].values

array([46, 36, 61, 57, 49, 45, 41, 41, 50, 45, 51, 40, 42, 32, 41, 49, 42,
       54, 45, 37])

We can use attribute-style access with column names that are strings. The result is again a Series.

In [40]:
data.Goals

0     46
1     36
2     61
3     57
4     49
5     45
6     41
7     41
8     50
9     45
10    51
11    40
12    42
13    32
14    41
15    49
16    42
17    54
18    45
19    37
Name: Goals, dtype: int64

In [41]:
type(c2)

NameError: name 'c2' is not defined

If we want to select two or more columns, double square brackets are required. The result in this case is a dataframe

In [42]:
data[['Goals','Shots pg']]

Unnamed: 0,Goals,Shots pg
0,46,15.4
1,36,14.3
2,61,16.3
3,57,15.4
4,49,16.0
5,45,11.2
6,41,13.5
7,41,15.4
8,50,14.3
9,45,15.1


In [43]:
c3=data[['Goals','Shots pg']]
type(c3)

pandas.core.frame.DataFrame

The loc indexer has the syntax data.loc[<row selection>, <column selection>] .
Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A slice object with labels, e.g. ``'a':'f'``.

In [44]:
data.loc[5,'Goals']

45

In [45]:
data.loc[5]

Team                  Atletico Madrid
Tournament                     LaLiga
Goals                              45
Shots pg                         11.2
Discipline                        580
Possession_Percent               52.3
Pass_percent                     82.9
AerialsWon                         15
Rating                           6.87
Name: 5, dtype: object

In [46]:
data.loc[2:5,:'Goals']

Unnamed: 0,Team,Tournament,Goals
2,Bayern Munich,Bundesliga,61
3,Paris Saint-Germain,Ligue 1,57
4,Barcelona,LaLiga,49
5,Atletico Madrid,LaLiga,45


In [47]:
data.loc[7:10,:'Discipline']

Unnamed: 0,Team,Tournament,Goals,Shots pg,Discipline
7,Juventus,Serie A,41,15.4,455
8,Manchester United,Premier League,50,14.3,381
9,AC Milan,Serie A,45,15.1,522
10,Lyon,Ligue 1,51,16.8,435


In [48]:
data.loc[:5,:'Shots pg']

Unnamed: 0,Team,Tournament,Goals,Shots pg
0,Manchester City,Premier League,46,15.4
1,Aston Villa,Premier League,36,14.3
2,Bayern Munich,Bundesliga,61,16.3
3,Paris Saint-Germain,Ligue 1,57,15.4
4,Barcelona,LaLiga,49,16.0
5,Atletico Madrid,LaLiga,45,11.2


In [49]:
data.loc[:5,'Rating']

0    7.03
1    6.96
2    6.93
3    6.91
4    6.88
5    6.87
Name: Rating, dtype: float64

In [50]:
data.loc[:7,:'Discipline']

Unnamed: 0,Team,Tournament,Goals,Shots pg,Discipline
0,Manchester City,Premier League,46,15.4,250
1,Aston Villa,Premier League,36,14.3,362
2,Bayern Munich,Bundesliga,61,16.3,281
3,Paris Saint-Germain,Ligue 1,57,15.4,465
4,Barcelona,LaLiga,49,16.0,401
5,Atletico Madrid,LaLiga,45,11.2,580
6,Real Madrid,LaLiga,41,13.5,381
7,Juventus,Serie A,41,15.4,455


“iloc” in pandas is used to select rows and columns by number, in the order that they appear in the data frame. 

Effectively each row has a row number from 0 to the total rows (data.shape[0])  and iloc[] allows selections based on these numbers.

The same applies for columns (ranging from 0 to data.shape[1])

The DataFrame row and column labels are however returned in the result.

In [51]:
data.iloc[:7,:4]

Unnamed: 0,Team,Tournament,Goals,Shots pg
0,Manchester City,Premier League,46,15.4
1,Aston Villa,Premier League,36,14.3
2,Bayern Munich,Bundesliga,61,16.3
3,Paris Saint-Germain,Ligue 1,57,15.4
4,Barcelona,LaLiga,49,16.0
5,Atletico Madrid,LaLiga,45,11.2
6,Real Madrid,LaLiga,41,13.5


In [52]:
data.loc[[1,2,8],'Goals']

1    36
2    61
8    50
Name: Goals, dtype: int64

In [53]:
data.loc[1:3,'Goals']

1    36
2    61
3    57
Name: Goals, dtype: int64

In [54]:
data.iloc[:3, 2]

0    46
1    36
2    61
Name: Goals, dtype: int64

In [55]:
data.iloc[[1,2,8],2]

1    36
2    61
8    50
Name: Goals, dtype: int64

In [56]:
data.iloc[-1] # last row of the dataframe

Team                   RB Leipzig
Tournament             Bundesliga
Goals                          37
Shots pg                     15.6
Discipline                    370
Possession_Percent           56.1
Pass_percent                 82.4
AerialsWon                   19.9
Rating                       6.81
Name: 19, dtype: object

In [57]:
data.iloc[:,-1] #last column of the data frame

0     7.03
1     6.96
2     6.93
3     6.91
4     6.88
5     6.87
6     6.86
7     6.86
8     6.85
9     6.85
10    6.85
11    6.84
12    6.83
13    6.82
14    6.82
15    6.82
16    6.82
17    6.81
18    6.81
19    6.81
Name: Rating, dtype: float64

In [58]:
data.ix[:3,'Goals']

AttributeError: 'DataFrame' object has no attribute 'ix'

The Pandas loc indexer can be used with DataFrames for selecting rows with a boolean / conditional lookup

In [59]:
data.loc[data.Goals >= 50]

Unnamed: 0,Team,Tournament,Goals,Shots pg,Discipline,Possession_Percent,Pass_percent,AerialsWon,Rating
2,Bayern Munich,Bundesliga,61,16.3,281,59.0,85.5,13.1,6.93
3,Paris Saint-Germain,Ligue 1,57,15.4,465,60.1,89.4,9.6,6.91
8,Manchester United,Premier League,50,14.3,381,54.6,84.7,15.5,6.85
10,Lyon,Ligue 1,51,16.8,435,54.5,85.6,14.0,6.85
17,Inter,Serie A,54,15.0,411,52.8,86.5,12.4,6.81


In [60]:
data.loc[5:10,:'Team']

Unnamed: 0,Team
5,Atletico Madrid
6,Real Madrid
7,Juventus
8,Manchester United
9,AC Milan
10,Lyon


We can change the default index to another column using set_index

In [61]:
data.set_index("Team", inplace=True)
data.head()

Unnamed: 0_level_0,Tournament,Goals,Shots pg,Discipline,Possession_Percent,Pass_percent,AerialsWon,Rating
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Manchester City,Premier League,46,15.4,250,61.2,88.9,13.9,7.03
Aston Villa,Premier League,36,14.3,362,49.5,78.3,20.2,6.96
Bayern Munich,Bundesliga,61,16.3,281,59.0,85.5,13.1,6.93
Paris Saint-Germain,Ligue 1,57,15.4,465,60.1,89.4,9.6,6.91
Barcelona,LaLiga,49,16.0,401,61.6,89.3,10.5,6.88


In [62]:
data.loc['Barcelona'] # selects a row based on team name

Tournament            LaLiga
Goals                     49
Shots pg                  16
Discipline               401
Possession_Percent      61.6
Pass_percent            89.3
AerialsWon              10.5
Rating                  6.88
Name: Barcelona, dtype: object

In [63]:
c4=data.loc[['Barcelona']]
type(c4)

pandas.core.frame.DataFrame

In [64]:
data.loc[['Barcelona'],["Goals", "Shots pg"]]

Unnamed: 0_level_0,Goals,Shots pg
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Barcelona,49,16.0


In [65]:
data.loc[['Barcelona','Aston Villa']]

Unnamed: 0_level_0,Tournament,Goals,Shots pg,Discipline,Possession_Percent,Pass_percent,AerialsWon,Rating
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Barcelona,LaLiga,49,16.0,401,61.6,89.3,10.5,6.88
Aston Villa,Premier League,36,14.3,362,49.5,78.3,20.2,6.96


In [66]:
data.loc[['Barcelona','Aston Villa']], [["Goals", "Shots pg"]]

(                 Tournament  Goals  Shots pg  Discipline  Possession_Percent  \
 Team                                                                           
 Barcelona            LaLiga     49      16.0         401                61.6   
 Aston Villa  Premier League     36      14.3         362                49.5   
 
              Pass_percent  AerialsWon  Rating  
 Team                                           
 Barcelona            89.3        10.5    6.88  
 Aston Villa          78.3        20.2    6.96  ,
 [['Goals', 'Shots pg']])

In [67]:
data.loc[['Barcelona','Aston Villa'], ["Goals", "Shots pg"]]

Unnamed: 0_level_0,Goals,Shots pg
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Barcelona,49,16.0
Aston Villa,36,14.3


Using loc, we can also slice the Pandas dataframe over a range of indices. If the indices are not in the sorted order.

In [68]:
data.loc[['Barcelona','Aston Villa'], :"Discipline"]

Unnamed: 0_level_0,Tournament,Goals,Shots pg,Discipline
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Barcelona,LaLiga,49,16.0,401
Aston Villa,Premier League,36,14.3,362


In [69]:
data.loc[['Barcelona','Aston Villa'], "Goals":"Rating"]

Unnamed: 0_level_0,Goals,Shots pg,Discipline,Possession_Percent,Pass_percent,AerialsWon,Rating
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Barcelona,49,16.0,401,61.6,89.3,10.5,6.88
Aston Villa,36,14.3,362,49.5,78.3,20.2,6.96


In [70]:
data.loc[data['Tournament'] == 'Premier League']

Unnamed: 0_level_0,Tournament,Goals,Shots pg,Discipline,Possession_Percent,Pass_percent,AerialsWon,Rating
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Manchester City,Premier League,46,15.4,250,61.2,88.9,13.9,7.03
Aston Villa,Premier League,36,14.3,362,49.5,78.3,20.2,6.96
Manchester United,Premier League,50,14.3,381,54.6,84.7,15.5,6.85
Chelsea,Premier League,40,14.1,331,58.9,86.9,16.0,6.84
Leicester,Premier League,42,12.1,440,51.8,81.3,15.6,6.83


In [71]:
data.loc[data['Tournament'] == 'Premier League',['Goals', 'Shots pg']]

Unnamed: 0_level_0,Goals,Shots pg
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Manchester City,46,15.4
Aston Villa,36,14.3
Manchester United,50,14.3
Chelsea,40,14.1
Leicester,42,12.1


In [72]:
dates = pd.date_range('1/1/2021', periods=10)

df = pd.DataFrame(np.random.randn(10, 4),
                  index=dates, columns=['A', 'B', 'C', 'D'])

df

Unnamed: 0,A,B,C,D
2021-01-01,1.345264,1.771594,0.698686,-0.730216
2021-01-02,-0.366869,0.410806,-0.151393,-0.724099
2021-01-03,-0.089142,0.570394,1.803552,-1.086197
2021-01-04,0.136208,-0.83152,1.301119,0.053572
2021-01-05,-2.206216,0.899399,1.407375,1.728097
2021-01-06,-0.257375,-0.125807,-0.029365,-0.368601
2021-01-07,1.011944,0.626131,-1.998985,-0.48234
2021-01-08,0.334218,0.234861,0.285419,0.748511
2021-01-09,-0.328679,0.368041,2.418786,-0.024204
2021-01-10,-1.14091,0.932104,-0.70651,1.865546


Differences compared to Numpy. In Numpy a single index accesses a row

In [73]:
df.values[0]

array([ 1.34526442,  1.77159395,  0.69868615, -0.73021568])

In a dataframe a single index accesses a column.

In [74]:
s = df['A']
s

2021-01-01    1.345264
2021-01-02   -0.366869
2021-01-03   -0.089142
2021-01-04    0.136208
2021-01-05   -2.206216
2021-01-06   -0.257375
2021-01-07    1.011944
2021-01-08    0.334218
2021-01-09   -0.328679
2021-01-10   -1.140910
Freq: D, Name: A, dtype: float64

In [75]:
s1=df[['A','B']]
s1

Unnamed: 0,A,B
2021-01-01,1.345264,1.771594
2021-01-02,-0.366869,0.410806
2021-01-03,-0.089142,0.570394
2021-01-04,0.136208,-0.83152
2021-01-05,-2.206216,0.899399
2021-01-06,-0.257375,-0.125807
2021-01-07,1.011944,0.626131
2021-01-08,0.334218,0.234861
2021-01-09,-0.328679,0.368041
2021-01-10,-1.14091,0.932104


In [76]:
df.T

Unnamed: 0,2021-01-01,2021-01-02,2021-01-03,2021-01-04,2021-01-05,2021-01-06,2021-01-07,2021-01-08,2021-01-09,2021-01-10
A,1.345264,-0.366869,-0.089142,0.136208,-2.206216,-0.257375,1.011944,0.334218,-0.328679,-1.14091
B,1.771594,0.410806,0.570394,-0.83152,0.899399,-0.125807,0.626131,0.234861,0.368041,0.932104
C,0.698686,-0.151393,1.803552,1.301119,1.407375,-0.029365,-1.998985,0.285419,2.418786,-0.70651
D,-0.730216,-0.724099,-1.086197,0.053572,1.728097,-0.368601,-0.48234,0.748511,-0.024204,1.865546


The loc attribute allows indexing and slicing that always references the explicit index: