### Motivation
Being in the course last year I noticed that there was some avoidable frustration when using some of the libraries.

ML is fun, you don't need a lot to get started, I think these should be more than plenty.
Also, numpy, sklearn, keras, pytorch, pandas etc, are some of the best documented libraries out there.
Remember to ask questions in the forums if something doesn't work, many have likely the same problem and are grateful for someone to have the curage and ask about it.




# 🔨 Survival Toolkit

### 🐼  Pandas
We're going to use the Titanic dataset, if you run this notebook on Kaggle, it's already imported, otherwise change below as in the comments

In [None]:
# import pandas and give it a shorter name 
import pandas as pd

### Importing CSV data
You can import many other files such as parquet read_parquet, or read_json, read_excel and they all work analogously
in the course you'll mostly use csv
You can download the csv file here https://www.kaggle.com/rahulsah06/titanic?select=train.csv if you don't want to sign up for kaggle :-(

In [None]:
#if you have data on your local machine or elsewhere change the path
path = '../input/titanic/train.csv'
#df is short for dataframe
df = pd.read_csv(path)

In [None]:
#ok, lets see what we have here
df

In [None]:
#you can print the top n rows with
n = 3
df.head(n)

In [None]:
#similarilty the bottom n
df.tail(n)

In [None]:
#or take a random sample size n
df.sample(n)

I didn't pass any arguments to read_csv()
It actually has a lot of useful arguments.
One is useful to know, as it that led to one of the frustrating errors I mentioned earlier.

**index_col**

if you don't pass an argument pandas will just enumerate the rows
but sometimes! you do have an index column, it makes no sense to use it as data, 
it's not a rank and just a random enumeration


In [None]:
#lets have a look again
df.head(5)

### Notice somthing about PassengerId?
indeed, it's an index column, so we either drop it, or we can use it directly by speciying it by int or string (name of col)

In [None]:
df = pd.read_csv(path, index_col = 0)
#this is equivalent
#df = pd.read_csv(path, index_col = 'PassengerId')

In [None]:
df.head(n)

In [None]:
#and if you want the enumerated index back
df.reset_index(inplace = True) 
#without inplace it just gives you a view without manipulating the df
#this is actually true for a lot of functions in pandas

In [None]:
df.head(3)

In [None]:
#like drop
#you can drop rows axis=0, or columns axis=1, many or just one like this
df.drop('PassengerId', axis=1).head(3)

In [None]:
#make it permanent
df.drop('PassengerId', axis=1, inplace = True)

### Loc and iLoc

In [None]:
#there are many different ways to access a subset of rows and columns, most straightforward are loc and iloc
#iloc you can think of more numpy esque and loc as just using names
df = pd.read_csv(path)
df.loc[2, "Survived"]

In [None]:
df.loc[2, ["Survived", "Name", "Sex"]]
#this btw outputs a pandas series, sometimes you'll run into bugs
#you can fix by wrapping 2 in a list [2] and you'll get a dataframe, try it

In [None]:
df.loc[[2, 4, 6], ["Survived", "Name", "Sex"]]

In [None]:
#with iloc
df.iloc[[2, 4, 6], [0, 2, 3]]

In [None]:
#hmm that does not seem right, why?
#fix it
df.iloc[[?, ?, ?], [0, 2, 3]]

In [None]:
#the question marks reminded me of a cool jupyter feature
? df.iloc

if you ever want to know what a function does or read the docs about it just do ? foo

In [None]:
df.reset_index()

### Masking

In [None]:
#you can define a boolean mask
mask = df.Survived == 1
mask = df['Survived'] == 1
#these are equivalent

In [None]:
#lets only look at survivors
df[mask]

In [None]:
#you can create more elaborate masks
mask2 = df.Name.str.contains('Adele')

In [None]:
df[mask2]

In [None]:
#chain them with & (and) or | (or) 
df[mask & mask2]

In [None]:
#you also don't have to save them in a variable
df[(df.Survived == 1)].head(3)

### Apply

In [None]:
#you can apply functions to rows and columns
#0 apply along rows (i.e. over all rows)
#1 apply along columns (i.e. from left to right over all columns)
df[['Fare']].apply('mean', axis=0) #pandas has string shortcuts for all stat functions sum, median, mean, etc.

In [None]:
#you can write your own function and use it the same way
def foo(x):
    return "I love ML"

df.apply(foo, axis=1)

In [None]:
#useful stats
df.describe()

In [None]:
df.median()
#etc.

### One Hot and label encoding directly in pandas
for the course projects you get just use pd.get_dummies or pd.factorize if you want to do one-hot encoding because you have access to the full dataset
in reality or in kaggle competitions you'll want to use sklearns label encoder or one_hot_encoder etc. because not all categories might be in future data.
An additional warning, redundancy is poison, if we convert Sex to one_hot you'll get two columns Male and Female, however in most cases if Male == 1, Female == 0 is implied. 
This can lead to overfitting and make your matrix non-invertible. It likely won't matter to pass tests but if you want that extra 0.005 on the Leaderboard, well label encoder (factorize) etc are better suited, if you like https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features read up :-).

In [None]:
#one hot
pd.get_dummies(df.Sex).head(3)

In [None]:
encoding, index = df.Sex.factorize()
df["F/M"] = encoding
df[["F/M", "Sex"]].head(3)

### saving csv files
This seemed to cause quite some errors and frustration in the projects
remember how we took care of index? 
well you can also take care of the header and that's not only when reading but saving
since the projects expect a csv file with a given format, make sure you don't add things like a header or the index that pandas automatically added when reading in.

In [None]:
#it would be pretty straightforward to do:
df.to_csv('Submission.csv')

In [None]:
#but look what happens
pd.read_csv("Submission.csv")

In [None]:
#pandas saved the added index column as Unnamed: 0
#if you try to submit this you'll get an error (in kaggle too btw)
#instead do 
df.to_csv("Submission.csv", index = False)

In [None]:
#tada
pd.read_csv("Submission.csv")

In [None]:
#sometimes you may also not want a header, i.e. if the submission expects ONLY the results
df.to_csv("Submission.csv", index = False, header = False)

In [None]:
pd.read_csv("Submission.csv")

# Numpy

In [None]:
import numpy as np

In [None]:
#making Sex a numeric feature
df["Sex"] = df["Sex"].apply(lambda x: 1 if x == "female" else -1)

In [None]:
#you can convert pandas dataframes or series directly to numpy
A = df[["Pclass", "Survived", "Sex"]].to_numpy()
y = df["Fare"].to_numpy()

In [None]:
#Transpose of A
A.T

In [None]:
#Matrix multiplication in pyton via @
A.T @ A

In [None]:
#Ralf would cringe:
x = np.linalg.inv(A.T @ A) @ (A.T @ y)

In [None]:
x

In [None]:
#this is the easiest way to save your predictions, then just do to_csv()
df["Predictions"] = A @ x

In [None]:
df[["Fare", "Predictions"]]

subtle differences between lists and arrays

In [None]:
alist = [1,2,3]
anarray = np.array(alist)

print(alist + alist)
print(anarray + anarray)
print(alist * 2)
print(anarray * 2)

### Creating Matrices

In [None]:
n_rows = 3
n_cols = 5
zeros = [0.0]*5
Z = np.zeros((n_rows, n_cols))
A = np.array([zeros, zeros, zeros])
B = np.zeros_like(A)
O = np.ones_like(A)
for m in [Z, A, B, O]:
    print(f"A Matrix \n {m} \n ")

In [None]:
Z

In [None]:
A

In [None]:
B

In [None]:
O

In [None]:
#reshape array to one column
A.reshape((-1, 1))

In [None]:
#make a 3d Matrix out of it
A.reshape((1, 3, 5))

In [None]:
#transpose
A.reshape((n_cols, n_rows))

In [None]:
#change to one dimension
A.flatten()

In [None]:
#3x3 Identity Matrix
np.eye(3)

In [None]:
#3x3 Random Matrix
R = np.random.random((3,3))
R

In [None]:
R.max()

In [None]:
#get range of ints from start to end -1 
np.arange(1, 10, step = 1)

In [None]:
#make arange matrix
np.arange(1,16).reshape((3,5))

In [None]:
#create #"num" equally spaced items from start to stup
np.linspace(start = 5, stop = 10, num = 6)

In [None]:
np.linspace(start = 5, stop = 10, num = 11)

In [None]:
#padding
np.pad(O, pad_width = 1, mode='constant', constant_values=0)

In [None]:
#dot products
np.array([1,2,3]) @ np.array([4,5,6])

In [None]:
np.array([1,2,3]).dot(np.array([4,5,6])) 

In [None]:
#sums, mean, median, min, max, std, var
O.sum()  #switch

In [None]:
#argmax might come in handy too for multiclass classification
R.argmax(axis=1) # returns the index with the highest value per row (axis=1 => evaluate over columns)

In [None]:
#squeeze you may use this a lot #get rid of the extra dimension that's useless
M = np.random.randint(0, 100, (1, 5, 2, 3))

In [None]:
M.shape

In [None]:
S = np.squeeze(M)
S.shape

In [None]:
#similarily
E = np.expand_dims(S, axis=0) #axis tells numpy where you want to insert the extra dimension
E.shape

In [None]:
np.array_equal(E, M)

That's it
there are lots more useful functions for numpy and pandas, but I don't think you'll need to know much more to get started

### Sklearn helpful guide
this is a a great place to get started https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Be sure to read the docs carefully and for simple tasks like regression, it may be worth to try the numpy theoretical approaches if you run into weird bugs