*ADS-A Week 1 Assignment 2*

# Pandas Exercises

### Introduction
[Pandas](http://pandas.pydata.org/) is a software library written for the Python programming language for data manipulation and analysis. It is built on NumPy. In particular, it offers data structures and operations for manipulating numerical tables and (time) series. The name is derived from the term "panel data", an econometrics term for multidimensional, structured data sets. Central in pandas are the data objects Series (indexed arrays) and DataFrames (full fledged tables). Many NumPy array indexing techniques can be used in the same way when indexing pandas arrays and series.

Pandas is well suited to do data preparation that is to be analysed with scikit-learn models. This is the main reason to familiarize ourselves with pandas. In most publicly available data machine learning notebooks pandas (or NumPy) is used to do the initial preparation. Pandas offer excellent tools to get data from CSV files, Excel files and databases. It also incorporates SQL-like manipulation of DataFrames.

In [None]:
import pandas as pd
import numpy as np

# Set some pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 5)
pd.set_option('display.max_rows', 5)

### Creating Series

In [None]:
# Create a pandas Series for the number list [10..20) ...
pd.Series(range(10, 20))

# Alternative
pd.Series(np.arange(10, 20))

In [38]:
s1 = pd.Series(range(10, 20))
s2 = pd.Series(range(10, 20), index=list('abcdefghij'))

print(s1)
print(s2)

# Get the index of both Series 

## Your code ...
print(s1.index)
print(s2.index)

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32
a    10
b    11
c    12
d    13
e    14
f    15
g    16
h    17
i    18
j    19
dtype: int32
RangeIndex(start=0, stop=10, step=1)
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')


In [77]:
# Create a Series of 10 random numbers

## Your code ...
import pandas as pd
import numpy as np
import random
pd.Series(randn(10))

0   -1.726199
1   -0.455465
2    0.327986
3    0.467831
4   -1.462368
5    0.348446
6    0.191710
7   -0.282348
8   -1.218600
9    1.580617
dtype: float64

In [None]:
# Create a Series of 15 numbers equally spaced over the range [20..60]

## Your code ...


In [86]:
a = pd.Series(range(6), index=list('abcdef'))

# Create the above Series from a dict

## Your code ...
li = list('abcdef')
lv = [0,1,2,3,4,5]
d = dict(zip(li, lv))
s = pd.Series(d)
print(s)
print(a)

a    0
b    1
c    2
d    3
e    4
f    5
dtype: int64
a    0
b    1
c    2
d    3
e    4
f    5
dtype: int32


### Getting and Setting Series Data 

In [88]:
s1 = pd.Series(range(10, 20))
# Get the value on position 3 (value of 4th element). 

## Your code ...
s1[3] = 10
print(s1)

0    10
1    11
2    12
3    10
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32


In [90]:
s2 = pd.Series(range(10, 20), index=list('abcdefghij'))
# Get value labeled 'c'

## Your code ...
print(s2['c'])
print(s2)

12
a    10
b    11
c    12
d    13
e    14
f    15
g    16
h    17
i    18
j    19
dtype: int32


In [95]:
# Show that you can also use a textual label as an attribute (s2.label)

## Your code ...
s2.label = 'TextualLabel'
print(s2.label)

TextualLabel


In [96]:
# Note that default index are lost once you have passed an explicit numeric index
s2 = pd.Series(range(10, 20), index=range(100, 110))
s2[0]      # throws key error

KeyError: 0

In [None]:
# To avoid this confusion explicit label lookup (loc) and default position lookup (iloc) have been defined:
s2[101] == s2.loc[101] == s2.iloc[1]

In [101]:
s = pd.Series(range(10, 20))
# Set every second value in the series to NaN (Not a Number, NumPy's version of void or null)

## Your code ...
for num in range(0,len(s)):
    if(num%2 == 0):
        s.iloc[num] = np.nan
print(s)

0     NaN
1    11.0
2     NaN
3    13.0
4     NaN
5    15.0
6     NaN
7    17.0
8     NaN
9    19.0
dtype: float64


### Boolean Selection or Masking

In [40]:
s = pd.Series(range(10, 20))
# Get rows with even values
s[s%2 == 0]

0    10
2    12
4    14
6    16
8    18
dtype: int32

In [44]:
s = pd.Series(range(10, 20))
# Get rows that can be either divided by 2 or by 3
# Use parenthesis and &, | and ! as boolean operators (same as in NumPy)

## Your code ...
s1 = s[(s%2 == 0) | (s%3 == 0)]
print(s1)

0    10
2    12
4    14
5    15
6    16
8    18
dtype: int32


In [46]:
# Re-index s1 so as to have a decent index again (starting with 0 and increment by 1 for each new entry)

## Your code ...
s1.index = range(0, len(s1))
print(s1)

0    10
1    12
2    14
3    15
4    16
5    18
dtype: int32


There's a lot more to Series. Consult the online documentation if you run into an unknown construction in some of the code you are studying!

### Creating DataFrames
DataFrames are 2 dimensional arrays with row and column labels. You can construct them from any 2 dimensional structure you might expect.

In [11]:
# Construct a pandas dataframe from a list of lists
pd.DataFrame([[1,2], [3,4]])

Unnamed: 0,0,1
0,1,2
1,3,4


In [12]:
# Yet another variant
pd.DataFrame(np.array(np.random.rand(12)).reshape(3,4), columns = list('abcd'))

Unnamed: 0,a,b,c,d
0,0.887523,0.78471,0.214034,0.757618
1,0.419067,0.872147,0.988521,0.698696
2,0.987546,0.356123,0.062066,0.004783


In [22]:
# Construct the dataframe 
#      0  1
#   0  1  2
#   1  3  4
# from a NumPy array

## Your code ...
ar = np.array([None,0,1,0,1,3,1,2,4]).reshape(3,3)
print(ar)
pd.DataFrame(ar)

[[None 0 1]
 [0 1 3]
 [1 2 4]]


Unnamed: 0,0,1,2
0,,0,1
1,0.0,1,3
2,1.0,2,4


In [55]:
# Construct a dataframe from 2 or more Series; Series should be equal sized; each Series will be a row

## Your code ...
s1 = pd.Series(range(10, 20))
s2 = pd.Series(range(0, 10))
df = pd.DataFrame({'1':s1, '2':s2})
print(df)

    1  2
0  10  0
1  11  1
2  12  2
3  13  3
4  14  4
5  15  5
6  16  6
7  17  7
8  18  8
9  19  9


In [73]:
# Create the dataframe
#           y     x
#     2014  1   aap
#     2015  2  noot
# from the Python dict data. 
data = {'y': [1,2], 'x':["aap", "noot"]}

## Your code ...
df = pd.DataFrame(data)
df.index = range(2014,2016)
df.columns = ['x','y']
#df.rename(columns={'x': 'y', 'y': 'x'}, inplace=True)
print(df)

         x  y
2014   aap  1
2015  noot  2


In [3]:
# Create the dataframe 
#            Ajax  Feyenoord  PSV
# Ajax        NaN        1.0  3.0
# Feyenoord   1.0        NaN  1.0
# PSV         2.0        1.0  NaN
# from scratch. Randomize the numbers of points for each game.

## Your code ...
from numpy.random import randn
li = ['Ajax', 'Feyenoord', 'PSV']
random.randint(1, 3) #random not seeing to work for me
df = pd.DataFrame()
df

NameError: name 'pd' is not defined

In [None]:
# Can you find a random seed that makes PSV champion? 
# ... or, quite challenging, can you create code to generate the final competiton table (nr of point for each club)

## Your code ...


### Getting and Setting Data from and to DataFrames
The []-operator on DataFrames is heavily overloaded, it works as follows:
- If the parameter within [] is a single value, it always identifies a single row/column and it returns a Series object.
- If the parameter within [] is a slicer object, it always identifies a set of rows/columns and it returns a DataFrame object.

In [122]:
df = pd.DataFrame(np.arange(6).reshape(2,3), index=[2014, 2015], columns=['AMS', 'EHV', 'RTD'])
# Return the first column of df 

## Your code ...
print(df['AMS'])

2014    0
2015    3
Name: AMS, dtype: int32


In [126]:
# Do the same with iloc

## Your code ...
print(df)

      AMS  EHV  RTD
2014    0    1    2
2015    3    4    5


In [None]:
# Return the last row of df. 

## Your code ...


In [None]:
# Return the last row of DataFrame df, but now return a DataFrame

## Your code ...


In [None]:
# Return the data in the last row of df as a list

## Your code ...


### Adding Data to DataFrames

In [129]:
df = pd.DataFrame(np.arange(6).reshape(2,3), index=[2014,2015], columns=list("PQR"))
# Add column S with values [10,11]

## Your code ...
df['s'] = pd.Series([10,11], index=df.index)
print(df)

      P  Q  R   s
2014  0  1  2  10
2015  3  4  5  11


In [None]:
df = pd.DataFrame(np.arange(6).reshape(2,3), index=[2014,2015], columns=list("PQR"))
# Add column S with values [10,11]. Use loc().

## Your code ...


In [None]:
df = pd.DataFrame(np.arange(6).reshape(2,3), index=[2015,2015], columns=list("PQR"))
# Append row [6,7,8] with label 2015 to dataframe

## Your code


### Views versus Copies
View/copy semantics are largely inherited form NumPy. Read this [thorough explanation on SO](http://goo.gl/a43POn) the understand the pandas rules for views versus copies.

In [None]:
df = pd.DataFrame(np.arange(6).reshape(2,3), index=[2014,2015], columns=list("PQR"))
df1 = df[['P', 'R']]
df1.loc[2015, 'R'] = -1
df

In [None]:
df = pd.DataFrame(np.arange(6).reshape(2,3), index=[2014,2015], columns=list("PQR"))
df1 = df
df1.loc[2015, 'R'] = -1
df

In [None]:
df = pd.DataFrame(np.arange(6).reshape(2,3), index=[2014,2015], columns=list("PQR"))
df1 = df[:]
df1.loc[2015, 'R'] = -1
df

Can you explain the subtleties in the above 3 cells?