## Good Sites:
https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

**1.** Import pandas under name pd.

In [2]:
import pandas as pd

**2.** Print out the version of pandas that has been imported.

In [3]:
print("Pandas Version: " + pd.__version__)

Pandas Version: 0.23.0


**3.** Print out all the version information of the libraries that are required by the pandas library

In [4]:
import numpy as np 
import sys 
print ("Numpy Version: " + np.__version__ )
print( "Python Version: " + sys.version)

Numpy Version: 1.14.3
Python Version: 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]


# DataFrame basics
### A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames
Difficulty: easy

Note: remember to import numpy using:

   import numpy as np
   
Consider the following Python dictionary data and Python list labels:

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],

         'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

(This is just some meaningless data I made up with the theme of animals and trips to a vet.)

**4.** Create a DataFrame df from this dictionary data which has the index labels.

In [5]:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],

     'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
    'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(data = data, index = labels)
df

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


**5.** Display a summary of the basic information about this DataFrame and its data.

In [6]:
df.describe()

Unnamed: 0,age,visits
count,8.0,10.0
mean,3.4375,1.9
std,2.007797,0.875595
min,0.5,1.0
25%,2.375,1.0
50%,3.0,2.0
75%,4.625,2.75
max,7.0,3.0


**6.** Return the first 3 rows of the DataFrame df.

In [7]:
df.head(3)

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no


**7.** Select just the 'animal' and 'age' columns from the DataFrame df.

In [8]:
print(df.loc[:, ['animal','age']])
print(df.loc[:, 'animal':'age'])
print(df[df.columns[0:2]])

  animal  age
a    cat  2.5
b    cat  3.0
c  snake  0.5
d    dog  NaN
e    dog  5.0
f    cat  2.0
g  snake  4.5
h    cat  NaN
i    dog  7.0
j    dog  3.0
  animal  age
a    cat  2.5
b    cat  3.0
c  snake  0.5
d    dog  NaN
e    dog  5.0
f    cat  2.0
g  snake  4.5
h    cat  NaN
i    dog  7.0
j    dog  3.0
  animal  age
a    cat  2.5
b    cat  3.0
c  snake  0.5
d    dog  NaN
e    dog  5.0
f    cat  2.0
g  snake  4.5
h    cat  NaN
i    dog  7.0
j    dog  3.0


**8.** Select the data in rows [3, 4, 8] and in columns ['animal', 'age'].

In [9]:
df.iloc[[2,3,7], 0:1]

Unnamed: 0,animal
c,snake
d,dog
h,cat


**9.** Select only the rows where the number of visits is greater than 3.

In [10]:
df[df['visits']>3]

Unnamed: 0,animal,age,visits,priority


**10.** Select the rows where the age is missing, i.e. is NaN.

In [11]:
df[ df['age'].isnull()]

Unnamed: 0,animal,age,visits,priority
d,dog,,3,yes
h,cat,,1,yes


**10.1** count the number of animals

In [12]:
df['animal'].value_counts()

cat      4
dog      4
snake    2
Name: animal, dtype: int64

** funky ones**

1. When calculating aggregate functions in the dataframe, E.g. Mean - when doing mean().values - this returns and array versus a Series

2. reshape(1,-1) changes the array from a Rank 1 array to a multidimensional array

3. Notice the use of [:-1 ] this gets everything but the last row 

4. the reshape( -1 ) is used to infer the dimension size by the original array length


In [13]:
df1 = pd.DataFrame( {'a':[1,3,5,3,10], 'b': [3,8,9,10,22], 'c': [ 5,3,2,3,5]})
dfSeries = df1.mean()[:-1]
df1Array = df1.mean().values.reshape(1,-1)
df2Array = df1.mean().values.reshape(1,3)

print(dfSeries)
print(df1Array)
print(df1Array.shape)
print(df2Array.shape)

a     4.4
b    10.4
dtype: float64
[[ 4.4 10.4  3.6]]
(1, 3)
(1, 3)


**11.**  Select the rows where the animal is a cat and the age is less than 3.

In [19]:
df[ (df['animal'] == 'cat' ) & (df['age'] < 3)]

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
f,cat,2.0,3,no


**12.** Select the rows the age is between 2 and 4 ( inclusive)

In [20]:
df[df['age'].between(2,4, inclusive = True)]

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
f,cat,2.0,3,no
j,dog,3.0,1,no


**13.** Change the age in row f to 1.5 

In [25]:
df.loc['f','age'] = 1.5
df

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,1.5,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


**14.** Calculate the sum of all visits (the total number of visits).

In [28]:
df['visits'].sum()

19

**15.** Calculate the mean age for each different animal in df.

In [30]:
df.groupby('animal')['age'].mean()

animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

**16.** Append a new row 'k' to df with your choice of values for each column. Then delete that row to return the original DataFrame.

In [136]:
# approach 1 
# df.loc['k', df.columns[:]] = ['rabbit',3,4,'yes']

# approach 2 - when we have large number of rows to insert 
idx = ['a','b','c','d','e','f','g','h','i','j','k']
idx.insert(len(idx),'l')

df = df.append(  {'animal':'python','age':3, 'visits':4 , 'priority': 'no'}, ignore_index = True)
df.index = idx
df


Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1.0,yes
b,cat,3.0,3.0,yes
c,snake,0.5,2.0,no
d,dog,,3.0,yes
e,dog,5.0,2.0,no
f,cat,1.5,3.0,no
g,snake,4.5,1.0,no
h,cat,,1.0,yes
i,dog,7.0,2.0,no
j,dog,3.0,1.0,no


**16 b** drop rows 

In [138]:
df.drop( ['l'], inplace = True ) 


** tidbits:**

Setting index of a dataframe. 
1. When an external array is needed, we can just use df.index = array/list etc 
2. When we want to set one of the columns as the index, then we say - df.set_index( 'column') 


In [None]:
print('this is a new test')