# Numpy

We will begin with Numpy as this is the foundation for other AI libraries and will also be useful if you want to create any dummy datasets.

We will begin by importing numpy with the common alias **np**

In [None]:
import numpy as np

## Numpy array 

* An array can be created by passing through an object to `np.array(object)`
* ndarray - N-dimensional array
    * Quicker than python collections
    * Data must be the same type 
        * `dtype` during creation of an array to decide the type
        * `.astype()` after creation to change the data type of the objects in the array
    * They have a **shape** and **size** attribute
        * You can change shape but not the size
* We often need to transpose matrix for multiplication using `.T`

In [None]:
ages_list = [20, 49, 17, 56, 18, 70]
ages_arr = np.array(ages_list)
print(ages_arr)

In [None]:
ages = np.array((20, 49, 17, 56, 18, 70))
print(ages)
print(type(ages))

In [None]:
print('Size:  ', ages.size)
print('Shape:  ', ages.shape)

In [None]:
ages.reshape(3,2)

In [None]:
ages

In [None]:
# indexing and slicing is the same as python collections
print('first', ages[0])
print('last', ages[-1])
print('slice', ages[2:5])

In [None]:
ages.shape = (6, 1) #2 dimensional
ages 

In [None]:
ages.reshape(3,2,1) # 3 dimesional etc...

In [None]:
ages.shape = (3,2)
ages

In [None]:
ages.T

In [None]:
ages.shape = (6,)
print(ages)
print(ages.dtype)

In [None]:
ages = np.array((20, 49, 17, 56, 18, 70), dtype='str')
print(ages)
print(ages.dtype)

In [None]:
ages.astype('float')

## Further array creation

Rather than passing through object numpy can create arrays for us. This can be useful for plotting graphs, creating dummy datasets etc. Below are some of the different functions available:

* `np.arange(start, stop, step)` - a range of numbers between the start and stop(not inclusive) values
* `np.linspace(start, stop, N)` - N evenly spaced numbers between the start and stop (inclusive)
* `np.zeros(N)` - array of N zeros
* `np.eye(N)` - identity matrix
* `np.random.choice()` - 
* `np.random.normal(mean, standard_deviation, N)` 

In [None]:
pi_range = np.arange(-np.pi, 2*np.pi, 0.5)
pi_range

In [None]:
spaced = np.linspace(0, 20, 7)
spaced

In [None]:
zeros = np.zeros(5)
zeros

In [None]:
I = np.eye(5)
I

In [None]:
coin = np.random.choice(['H','T'], 10) #Fair coin
coin

In [None]:
unfair_coin = np.random.choice(['H', 'H', 'T'], 10) #Fair coin
unfair_coin

In [None]:
samp = np.random.normal(10, 3, 10)
samp

# Operations

* Operations are performed element wise on arrays
* Statisical operations are easy to calculate

In [None]:
even_list = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
even_list * 2

In [None]:
evens = np.array(even_list)
print('Array', evens)
print('Addition:\n', evens + 5)
print('Subtraction:\n', evens - 5)
print('Multiplaction:\n', evens * 2)
print('Division:\n', evens / 2)
print('Modulus:\n', evens % 5) # Remainder
print('Less than 10\n:', evens < 10)

# can combine operations
print('Multiples of 5:\n', evens % 5 == 0)


In [None]:
print(pi_range, '\n')
print('min:', np.min(pi_range).round(2))
print('max:', np.max(pi_range))
print('sum:', np.sum(pi_range))
print('mean:', np.mean(pi_range))
print('median:', np.median(pi_range))
print('var:', np.var(pi_range))
print('std:', np.std(pi_range))

# Pandas

* Key module for handling datasets
* Time series data
* import Pandas with common alias **pd**

In [None]:
import pandas as pd

In [None]:
np.random.seed(2) # make the random numbers predictable
salary = np.random.normal(50, 20, 50)
salary

## Data Series

* Generalised list with advanced indexing
* Series will have an **index** and **value**
    * You can use custom indexing
* Element wise operations (as above)
* Filtering - if we want to remove data or create another series
    * Filter using logic and boolean operations and create new Series with different size
    * `.where()` - if number meets condition keep otherwise NaN. Preserve size
    * `.mask()` - if number meets condition then mask. Preserve size
    * NaN - Not A Number - somewhere where you expect a number but there isn't 

In [None]:
ds_salary = pd.Series(salary.round(2))
ds_salary

In [None]:
print('index:', ds_salary.index)
print('values:\n', ds_salary.values)

In [None]:
print('first', ds_salary[0])
print('slice\n', ds_salary[10:20])
# cannot do ds_salay[-1]

In [None]:
ds_salary * 1000

In [None]:
ds_salary < 20

In [None]:
low_sal = ds_salary[ds_salary < 20]
low_sal

In [None]:
ds_where = ds_salary.where(ds_salary < 20)
ds_where

In [None]:
ds_mask = ds_salary.mask(ds_salary<20)
ds_mask

In [None]:
salaries = ['Salary ' + str(i) for i in range(1, 51)]
ds_salary.index = salaries

In [None]:
ds_salary

In [None]:
ds_salary.value_counts() # count of all unique values

In [None]:
ds_salary.sort_values()

# Data Frame

* DataFrames are tables
* They are collection of Data series
* Seaborn has sample datasets which we can read as data sets
   *  **NOTE** - we will look at seaborn in more detail
* `.describe()` - useful method to get summary statistics from dataframe
* Easy data manipulation and handling

In [None]:
df_sal = pd.DataFrame({
    'Salary No': salaries,
    'Salary': salary
})
df_sal.head() #by default will show the first 5 rows

In [None]:
df_sal.describe()

In [None]:
import seaborn as sns

In [None]:
titanic = sns.load_dataset('titanic')
titanic.head()

In [None]:
titanic.info()

## Missing values

* You can do easy checks for missing values
* You can't leave them in the sample as NaN as you won't be able to create a model so we have to deal with them
* Handling NaN:
    * Remove them from the sample
    * Replace them with the mean values for that column
    * Use ML to predict the values 
    * Encode them as -1 or -9999 or a categorical values

In [None]:
titanic.isnull().sum() # check for missing values

In [None]:
titanic.dropna(inplace=True)

In [None]:
titanic.isnull().sum()

In [None]:
titanic

# Searching data

* `.loc(row, column)` - access using labels
* `.iloc(row, column)` - indices access

In [None]:
titanic.loc[:,'age'] #all rows, 'age' column

In [None]:
titanic.loc[:20, 'age'] #labels up to 20, for 'age' column NOTE we dropped NaN values

In [None]:
titanic.loc[:50, ['age', 'pclass']]

In [None]:
titanic.loc[3, :] #3rd row all columns

In [None]:
titanic.iloc[:20, :5] #First 20 rows and first 5 columns

In [None]:
titanic[10:20]

# Adding columns

In [None]:
# Say we wanted to guess the salary of the person purchasing the ticket
titanic['Salary'] = (titanic.loc[:,'fare'] * 1.8)*100 /titanic.loc[:, 'pclass']
titanic.head()

In [None]:
# Children won't have a salary
titanic.loc[titanic['who']=='child', 'Salary'] = 0
titanic

In [None]:
titanic.describe()

In [None]:
titanic.loc[:, ['pclass', 'age', 'fare', 'Salary']].head(20)

In [None]:
titanic[titanic['pclass'] == 3]

In [None]:
titanic.columns

In [None]:
titanic.rename({'Salary':'assumed_pay'}, axis=1, inplace=True)

In [None]:
titanic.head()

In [None]:
titanic.drop('assumed_pay', axis=1).head()

In [None]:
titanic.groupby('who').mean()

In [None]:
titanic.groupby('who').get_group('woman').head()

# Merge and join

In [None]:
df_sal

In [None]:
yrs_employed = pd.DataFrame({
    'Salary No': ['Salary 1', 'Salary 5', 'Salary 41', 'Salary 35', 'Salary 19'],
    'Years Employed': [10, 6, 19, 7, 28]
})

yrs_employed

In [None]:
df_sal.merge(yrs_employed, left_on='Salary No', right_on='Salary No')
# Defaults to inner join

In [None]:
df_sal.merge(yrs_employed, left_on='Salary No', right_on='Salary No', how='outer')

In [None]:
yrs_employed['Years Employed']

In [None]:
yrs = yrs_employed['Years Employed']
yrs.index = yrs_employed['Salary No']
yrs

In [None]:
df_sal.join(yrs, on='Salary No')

# Exercise

* Read in a dataset into a DataFrame
* Investigate the data and carry out any preparation
    * Summary stats
    * Column data
    * Missing Values
    * Add new columns
    * Rename column names
    * etc.
    