# Welcome to Drishti's workshop on Machine Learning/ Deep Learning /Computer vision

<hr style="border:2px solid gray"> </hr>

If you are not familiar with Python Programming language don't worry. Python is one of the easiest computer languages you can learn today and it won't take you long to start writing it yourself! (you would catch up even quicker if you already know a different language like C or javascript or java)

These are excellent python tutorials made by Corey Schafer : 

https://www.youtube.com/watch?v=k9TUPpGqYTo&list=PL-osiE80TeTt2d9bfVyTiXJA-UTHn6WwU&index=2

Watch videos: 2,3,4,5,6,7,8,9,15 to 21.

**Make sure you watch these before proceeding further**.

Feel free to explore python on your own by reading articles about it or watching videos made by other creators. Also don't worry if you don't understand anything at any point. We will be always available to help you. :-)

Now that you have watched these videos lets start with our first library:

PS: feel free to change the values/shapes of any matrix and try to understand what is going on by experimenting! Have fun :)

<hr style="border:2px solid gray"> </hr>

In [None]:
#this cell is to ignore all warnings that python throws out
#Simply run this.
#No need to understand or worry about this right now.
import warnings
warnings.filterwarnings("ignore")

<hr style="border:2px solid gray"> </hr>

# NumPy

In machine learning/deep learning/computer vision you will deal a lot with matrices. Large matrices. We are talking about matrices containing millions of elements and multiple axes. And you will have to perform various operations on these, like matrix multiplication, addition, etc. But don't worry Python has you covered here

In [None]:
import numpy # pronounced "num pie"

When you ran the above code, you told python to import a library called `numpy`. Now, `NumPy` stands for Numerical Python, it is a library made by some brilliant people for the purposes of doing calculations on large matrices effectively. 

Now, here is something that I want you to remember about python. Python is all about doing things that actually matter. In Python we don't believe in reinventing the wheel every time we have to do something. We simply focus on building the best car possible from parts that someone else has already manufactured and optimized. 

So many-a-times in python you will see people "import" various libraries to use the tools created by other people instead of re-writing the code themselves.

`Numpy`, for example has been written in faster programming languages than python, has been optimized to perfection, is capable of utilizing the multiple cores on your CPU and has an enormous community of people who constantly check for bugs and maintain it. 

Let me give you a quick demo of the power and simplicity of `NumPy`. Say, you want a matix with 12 rows and 12 columns that contains all zeros. Well using `numpy` it takes merely one line!

In [None]:
numpy.zeros((12,12))

As you can see from running the above code you have a matrix of size (12,12) containing all zeros.

That was fun wasn't it. Alright lets try something new. We are programmers, we are lazy people. We aren't going to count till 12 to check if the above output is actually (12,12). We want `numpy` to do that too!

So lets run the above code once more but this time we will store the output into a variable.

In [None]:
my_matrix = numpy.zeros((12,12))

Notice how we don't get any output this time. We can see the matrix by printing it:

In [None]:
print(my_matrix)

Alright, now lets find the shape of `my_matrix`

In [None]:
numpy.shape(my_matrix)

Yes! It is infact (12,12)

Now, writing `numpy` again and again is tiresome (remember we programmers are lazy people!) so lets rename `numpy` to something else, something short, like `np`. In python you can do this by making a small change in the import statement:

In [None]:
import numpy as np

Voilà! Now you can simply write `np` in the code instead on `numpy` and everything will work just fine! Lets try it:

In [None]:
np.zeros((12,12))

Now look, you can name any library anything you want, but there are certain customs we must follow. Kinda like how you can legally name your son anything you want but still you won't name him dummy due to social pressure or something, although it would make a good name, but I digress.

Similarly in python community we have certain customs like, `numpy` should always be imported as `np`. This makes it easier for someone who is looking at your code to understand what is going on.

Alright now that that is out of the way lets have some more fun with numpy and see what it has to offer.

## Converting Python lists to NumPy's Ndarrays

In [None]:
my_list = [[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]] # my list is a python list

In [None]:
my_list

Here we have created a list in python. Now whenever we wish to use `numpy` on python lists we have to convert them into something called an `Ndarray`. We can do this by doing:

In [None]:
ndarray_of_my_list = np.array(my_list)

In [None]:
ndarray_of_my_list

Hm... It looks the same doesn't it? So what did we actually do? Well lets use the in-build `type` function to check the types of `my_list` and `ndarray_of_my_list`

In [None]:
type(my_list)

In [None]:
type(ndarray_of_my_list)

As you can see `ndarray_of_my_list` is in fact different from `my_list` . `my_list` is a simple python `list` whereas `ndarray_of_my_list` is an `ndarray`. Now that we have an ndarray in our hands we can do cool operations on it with the power of `NumPy`!

## Ndim and Arange

Lets find the dimension of that array!

In [None]:
a = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

In [None]:
a = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

In [None]:
np.ndim(a) # np.ndim() gives the dimensions of the array.

`a` is a 2-D array.

Lets see some more dimensions. I am going to use `np.arange` function for this. `np.arange` takes the starting value and the ending value and forms an array for us using the values in between.

In [None]:
b = np.arange(1,10)

In [None]:
b # we get an array from 1 to 9. np.array will give values including the first element (here 1) 
  # but excluding the last element (here 10)

In [None]:
np.ndim(b) # as you can see b is a 1-D array

What? You say you want an array with values starting from 1 and going till 100 with span of 3?  Well `arange` got you covered!

In [None]:
np.arange(1,100,3) # the third argument tells arange to take steps of 3 instead of the default 1.

Lets see some more cool examples:

In [None]:
np.arange(100,1,-1) # go reverse man! take negitive steps

In [None]:
np.arange(1,3,0.01) # baby steps

In [None]:
np.arange(2,1,-0.01) # negitive baby steps!

I think you get the point. This is why I love python. You get so much done in just one line !

## Reshape

Awesome! Now lets try to make a 3-D array! 

First lets get a long 1-D array and then lets reshape it to out desired shape

In [None]:
one_d_array = np.arange(0,40)

In [None]:
one_d_array # 0 to 39. (remember 40 is not included!)

Now lets reshape this. We want there to be 

In [None]:
three_d_array = np.reshape(one_d_array,(10,2,2))

In [None]:
three_d_array

In [None]:
np.shape(three_d_array) # and as you can see reshape reshaped out array into the desired shape!

In [None]:
np.ndim(three_d_array) # also because our shape has a third value, the output is a 3-D array!

## Random

Well arranging values in an order is all well and good but it is kinda boring to be honest..

What if we want a matrix with random values. Lets get a little random shall we! 

In [None]:
np.random.rand(4,5) # np.random.rand will give you a matrix of random values with the shape you asked for(here (4,5))

Notice that `np.random.rand` gives random numbers between 0 and 1

You can do the following to get values between 0 and 100:

In [None]:
np.random.rand(4,5)*100

Another cool `random` feature (pun unintended):

In [None]:
np.random.randint(5,size=(2,3))

2x3 array with random integers between 0–4

## Sorting a matrix

In [None]:
matrix = np.random.rand(4,5)

In [None]:
matrix

In [None]:
matrix.sort() # sort is inbuild in python. It can sort lists and ndarrays. Notice how it doesn't return anything...

In [None]:
matrix

`sort` will sort a matrix. Notice that in this case the variable `matrix` changes to a sorted form.

`sort` sorts the matrix about the axis -1 by default (here, sorted along rows...). 

You can explicitly tell it sort about 0 (here, along the columns) 

In [None]:
matrix.sort(axis=0)

In [None]:
matrix

## Copying a matrix

In [None]:
matrix = np.random.rand(3,3)*100

In [None]:
matrix

As you noticed earlier, when we did `.sort` on a matrix, the values in that matrix got affected. Sometimes we don't want this to happen, we wish to create a copy of the original matrix and work on that instead of affecting the original.

Now there is a catch here, simply doing something as shown below won't work (you might think you are simply assigning the values in matrix to new_matrix but wait and see):

In [None]:
new_matrix = matrix

Why won't this work? Well lets first check what elements new_matrix has:

In [None]:
new_matrix

Now lets try modifying `new_matrix`

In [None]:
new_matrix.sort(axis=-1)

In [None]:
new_matrix

And, `new_matrix` is modified, well no surprise there...

But lets check `matrix` once shall we..

In [None]:
matrix

Wait what!? `matrix` got sorted too!?? 

Yes. You see when you do `new_matrix = matrix` you are more or less telling python to refer to `matrix` as `new_matrix`. Thats all. It is like renaming your son from dummy to tummy. He is still gonna be your son!

For people who know about some different programming language this is basically `call by reference`

Alright so what should we do now? How should we copy the contents of `matrix` into `new_matrix`. (basically `call by value`)

`np.copy` to the rescue!

In [None]:
matrix = np.random.rand(3,3)*100 # lets get a fresh new matrix to work with
matrix

In [None]:
new_matrix = np.copy(matrix)

In [None]:
new_matrix

In [None]:
new_matrix.sort() # sort new_matrix

In [None]:
new_matrix # new_matrix sorted along the row

In [None]:
matrix

Look! `matrix` isn't affected. It is like you cloned your son dummy into a completely new being tummy. (You did that to your own son! omg) And now you can safely experiment on your new son tummy. (that analogy went too dark too quickly)

Alright alright I sort of lied to you. You can sort a matrix using `np.sort` instead of python's inbuilt `sort` function. That way you won't actually affect the array you are sorting and will get a new array automatically. (I just wanted to teach you about `call by reference` and `call by value`)

In [None]:
matrix = np.random.rand(3,3)*100 # lets get a fresh new matrix to work with
matrix 

In [None]:
new_matrix = np.sort(matrix)

In [None]:
new_matrix

In [None]:
matrix

Notice how using `np.sort` instead of `sort` didn't affect the original `matrix` and simply returns a sorted `new_matrix`

## Adding and removing elements

In [None]:
my_arr = np.arange(1,10)
my_arr

You can add elements to this array by using the `np.append` function

In [None]:
my_arr = np.append(my_arr,90) # append one to my_arr and store the output in my_arr
my_arr = np.append(my_arr,100)# then append 100 to that
my_arr

Lets say you have 2 arrays. `array1` and `array2`. And for some reason you want to make a bigger array by joining these two arrays together..

In [None]:
array1 = np.random.randint(low = 0,high = 10,size = (10,))# make an array using random integers from [low,high) of size (10,)
array1

In [None]:
array2 = np.random.randint(low = 0,high = 10,size = (10,))
array2

Using `np.append` you can concatenate the two arrays into one large array

In [None]:
np.append(array1,array2)

Lets try this with 2-D arrays

In [None]:
array1 = np.random.randint(low = 0,high = 10,size = (10,10))# make an array using random integers from [low,high) of size (10,10)
array2 = np.random.randint(low = 0,high = 10,size = (10,12))# make an array using random integers from [low,high) of size (10,12)
print('array1\n',array1)
print('array2\n',array2)

In [None]:
np.append(array1,array2) # np.append has axis = None by default. So we get this weird output which is 1-D.

In [None]:
np.append(array1,array2,axis=-1) #instead if we give it axis=-1 it concatenates the 2 matrices along the "column axis"

You can remove elements using `np.delete`. `np.delete` take the position of the element to be deleted  

In [None]:
array1

In [None]:
np.delete(array1, (0,0)) # removes element (0,0)

## Slicing and indexing ndarrays

Before reading this make sure you have watched [Corey Schafer's video on python list slicing and indexing](https://www.youtube.com/watch?v=ajrtAuDg3yw&list=PL-osiE80TeTt2d9bfVyTiXJA-UTHn6WwU&index=20&t=0s). Else you won't understand what is going on.

Also you can read [this StackOverflow answer](https://stackoverflow.com/a/509295/11573842) 

(StackOverflow is a great site meant for developers/coders to ask questions and share problems with other developers/coders and get possible solutions)

You can slice and index ndarrays just like you slice python lists.

In [None]:
a = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a

In [None]:
a[0] # gets the row 0

In [None]:
a[0][0] # first get the row 0 and then get 0th element in that, here 1

In [None]:
a[:,2] # all rows, 3rd column 

Notice how the result is an ndarray in itself.

Alright, enough `NumPy` for now. `NumPy` is a great and extremely useful library which is capable of doing a lot more than what I have shown you. So in your free time try to learn more about it.

## Basic operations on matrices and arrays

Note all of the following examples have been taken from the [official documentation](https://numpy.org/doc/1.18/user/quickstart.html#basic-operations) with some added explanation wherever necessary.

In [None]:
a = np.array([20,30,40,50])
b = np.arange(4)

In [None]:
a,b

In [None]:
c = a-b
c #fairly obvious

In [None]:
b**2 # squares each element

In [None]:
b<3 # does logical operations on each element

In [None]:
A = np.array( [[1,1],
             [0,1]] )

In [None]:
B = np.array( [[2,0],
             [3,4]] )

In [None]:
 A * B  # elementwise product

In [None]:
np.multiply(A,B) # elementwise product

In [None]:
 A @ B   # matrix multiplication

In [None]:
A.dot(B) # another matrix product

In [None]:
A.T # transpose

In [None]:
np.transpose(A)  # transpose

In [None]:
a = np.ones((2,3), dtype=int)
print("before:\n",a)
a *= 3 # equivalent to a = a*3. This is a short notation. Because, you guessed it, python programmers are lazy people
print("after:\n",a)

In [None]:
a = np.random.random((2,3))
print("matrix:\n",a,"\n")
sum_ = a.sum() #sum is a python keyword and can't be used as a variable so I added a _ after it
print("Sum",sum_) 
min_ = a.min() #min is a python keyword and can't be used as a variable so I added a _ after it
print("Min value",min_)
max_ = a.max() #max is a python keyword and can't be used as a variable so I added a _ after it
print("Max value",max_)

In [None]:
np.argmax(a) # Returns the indices of the maximum values along an axis.

In [None]:
np.argmin(a) # Returns the indices of the minimum values along an axis.

In [None]:
a = np.arange(-5,5,0.2)
a # numpy gives scientific output like this sometimes...

In [None]:
np.ceil(a) # Returns the ceiling of the input, element-wise.

In [None]:
np.floor(a) #Returns the floor of the input, element-wise.

<hr style="border:2px solid gray"> </hr>

# Pandas

**A quick reminder**: *make sure you have watched the [5th video]( https://www.youtube.com/watch?v=daefaLgNkw0&list=PL-osiE80TeTt2d9bfVyTiXJA-UTHn6WwU&index=6&t=0s) and [19th video](https://www.youtube.com/watch?v=ajrtAuDg3yw&list=PL-osiE80TeTt2d9bfVyTiXJA-UTHn6WwU&index=19) of the playlist shared above.*

5th video explains about python dictionaries. `Pandas dataframes` are similar to dictionaries. 19th video explains list slicing which will be useful.

Lets start with `Pandas` now. Here is what the official Pandas site has to say about it:
> Pandas is a fast, powerful, flexible and easy to use open source **data analysis and manipulation** tool,
built on top of the Python programming language. 

Now we are getting into one of the core aspects of data science. Data analysis and data manipulation. 

We are going to use this tool to get valuable insight into a developer survey's result.

First lets get our data. Go to this [url](https://insights.stackoverflow.com/survey). Then click on `Download Full Data Set (CSV)` for the **year 2019**. Download the entire folder into the `assets` folder by clicking this button in the left top corner of the screen: ![download button](assets\Images\download.png)This should have downloaded a zip file.

**If you have the appropriate tool to unzip zipped folders then unzip all its contents it into a folder named `developer_survey_2019` inside the `assets` folder. Else run the cell below, it will automatically unzip the zipped folder and save the contents in `developer_survey_2019`.**


**If you have any difficulty in this step contact us immediately, as without this you can't proceed any further.**

In [None]:
#Code to unzip the zipped folder.

#Make sure that you saved `developer_survey_2019.zip` folder inside the `assets` folder, only then this code will work.

#There is no need to worry about this code.
#If you are interested then you can learn about the libraries involved by googling them and reading about them.
#But it is ok if you don't understand this part. You will eventually learn python 
#and its different modules once you start using the language more often...
import os
import errno
import zipfile

path_to_zip_file = 'assets/developer_survey_2019.zip'

if os.path.exists(path_to_zip_file):
    with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
        zip_ref.extractall('assets/developer_survey_2019/')
    assert os.path.exists('assets/developer_survey_2019/')
    print('Unzipped successfully into assets/developer_survey_2019')
else:
    print("Make sure that the zip file has been downloaded to the assets folder other wise this code won't work")
    print("Contact us if you need any help.")
    raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), 'developer_survey_2019.zip')

The `developer_survey_2019` folder contains the results of a survey conducted by [StackOverflow](https://en.wikipedia.org/wiki/Stack_Overflow) in the year 2019. 

(If you are interested to know more about this survey read the pdf : `so_survey_2019.pdf` inside `assets/developer_survey_2019`)

Open the folder `developer_survey_2019` and find a file named `survey_results_public.csv`. This is a `csv` file. `csv` stands for `comma separated values`.

>A CSV is a file which allows data to be saved in a tabular format. CSV files can be used with most spreadsheet programs, such as Microsoft Excel or Google Spreadsheets. They differ from other spreadsheet file types because you can only have a single sheet in a file, they can not save cell, column, or row. Also, you cannot not save formulas in this format.

**Why are .CSV files used?**

>These files serve a number of different business purposes. They help companies export a high volume of data to a more concentrated database.

In short: `csv` files are basically excel files with commas instead of cells.

Alright, now try opening the file in excel. You should see something like this:

![csv file](assets\Images\csv.png)

Rows upon rows of tasty data!!

Lets try getting the data from this file this using the pandas library.

## Getting the data

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("assets/developer_survey_2019/survey_results_public.csv")

In the above line we use `pd.read_csv` to well, read the csv file pointed to by the path : `assets/developer_survey_2019/survey_results_public.csv`. Then we store this into `df`. Now `read_csv` return a `pandas dataframe object`. 

Lets see what we have inside `df`...

In [None]:
df

Immediately you can start seeing valuable information about your data. 

Look at the first row...

![first row](assets\Images\first_row.png)

This row contains the `features` of you dataframe. While the rest of the rows contain the data corresponding to these features.

Now lets use `pandas` to get more information about this dataset:

## Exploring the data

In [None]:
df.info()

Now here is some information we can get from this:
- So df has `88883 rows` and `85 features`. (You can also use `df.shape` to get this exact information)
- We can see the names of all 85 features with info about the datatype of the data within. (Under the `Dtype` column)(You can also use `df.dtypes` to get this exact information)
    - `int64` indicates that the feature has integer values in it.
    - `float64` indicates that the feature has floating values in it (basically numbers with decimals).
    - `object` indicates that the feature has strings in it. (like : "Not employed, and not looking for work","20s",etc)
- also this line `dtypes: float64(5), int64(1), object(79)` clearly indicates that we have only `5 features with dtype float64`,`1 with dtype int64` and the rest `79 are strings`
- also note the `Non-Null Count` column. It shows how many of the 88883 entries have `NaN` values (`NaN` stands for `Not A Number`. It basically means that this data is missing or is `Null`)

Go back to the `assets\developer_survey_2019` folder. This time open the file named `survey_results_schema.csv` in excel. You should see this:

![csv file](assets\Images\csv2.png)


Clearly this file contains the meaning of all the 85 features. You can go thought them if you are interested.

Now lets get back to our `df`

Instead of printing out the entire `df` everytime we wish to take a peak at it we will use `df.head` function. This allows us to see the "head" of the dataframe. That is the top few items

In [None]:
df.head(7) # show us the first 7 rows. If no aurgument is passed then default = 5

If we are interested in seeing the last few rows we use `df.tail`

In [None]:
df.tail() #default = 5

To see a list of available columns do:

### Basic column and row handling

=> **Column**

In [None]:
df.columns # and just like that you are blessed with 85 feature names!

Now lets try to see only one column of this `dataframe`

In [None]:
df['Country'] # just like dictionaries!

Doing `df['Country']` shows us only the `Country` column. You can see multiple columns as well by passing them as `list`

In [None]:
df[['Country','Age']]

=> **Row**

Now to access a particular row we will use `df.iloc`. `iloc` takes in the index of the row and the index of the column. If the index of the column isn't specified then all columns are included.

In [None]:
df.iloc[0] # get the first row of the df. 0 here is the index of that row

Like columns you can get multiple rows:

In [None]:
df.iloc[[0,1]] # first 2 rows

You can get a particular column in these 2 rows too:

In [None]:
df.iloc[[0,1],1] # first 2 rows and column 1 (MainBranch)

**Note:** You can also use **`.loc`** in this case. `.iloc` and `.loc` are quit similar. `iloc` is based on `index based slicing` whilst `loc` is based on `label based slicing`. Now beware, `iloc` being `index based` follows python's indexing rules. So when you try something like this: `df.iloc[5:9]` you are selecting rows `[5,9)`, where the last value (`9` in this case) isn't included in the slice. Same example as above but with `.loc`:

In [None]:
df.loc[[0,1],"MainBranch"] # notice the label : MainBranch

**value_counts**

Now lets try to explore the data a bit by using the `value_counts` method/function.

In [None]:
df['Country'].value_counts()

And just like that we know that this site is popular amongst people from United States,India,Germany,United Kingdom and Canada.

### Filtering

I want to show you something.. run the cell below:

In [None]:
df['Gender']=='Man'

Notice how this results in a long column of True/False values? Alright run the following cells too:

In [None]:
mask = df['Hobbyist']=="Yes"
mask # notice how the mask is True only when 'Hobbyist'=="Yes" otherwise it is False

In [None]:
df[mask].head()  #you can also do df.loc[mask].head() to get the exact same result. Try it

When we use this mask inside `df` we notice that we only get the rows that had `True` in them essentially we are filtering out the cases where the person is a hobbyist.

You can go into a lot more depth with this: Say you are interested in `hobbyists` who are also `male`. We will use `df.query` for this purpose.

In [None]:
df.query('Gender =="Man" and Hobbyist =="Yes"').head()

If you are only interested in seeing the `Respondent` feature of the people who are both hobbyists and male. Then do:

In [None]:
df.query('Gender =="Man" and Hobbyist =="Yes"') ['Respondent'] 

Remember how `df` slicing returned mini dataframes..? This is what allowed us to write `['Respondent']` after `df.query('Gender =="Man" and Hobbyist =="Yes"')` because the output of `df.query('Gender =="Man" and Hobbyist =="Yes"')` is a dataframe in itself..

In [None]:
type(df.query('Gender =="Man" and Hobbyist =="Yes"')) # the output is a dataframe

You can also use `&` operator to combine two masks together:

In [None]:
mask = ((df['Student']=="No") & (df['Country']=="India")& (df['LanguageWorkedWith']=="Python"))
df.loc[mask,"Age1stCode"]

The above two lines find the `age at which Indian-non-students-pythonistas started coding`.

**A few more examples:**

In [None]:
mask = ((df['Age']>14.0) & (df['Age']<20.0))
gender = df.loc[mask,"Gender"] # get the gender of people between the age group (14,20).
print(gender)
print("\n---Value counts---") # get the value counts
print(gender.value_counts())

In [None]:
ages = df.Age1stCode.value_counts() # you can do df.Age1stCode too instead of df['Age1stCode']
ages

In [None]:
print(ages[0:10]) # we can slice it to get a list of top 10 ages at which people wrote their first code...

<a id='ages_internal_link'></a> We can plot this using the `plot` function. We are using a `bar plot` here

In [None]:
df['Age1stCode'].value_counts()[0:20].plot(kind='bar') # note [0:20] is plotting the top 20 values only

Now you can clearly compare and contrast...

Try plotting other things like `Age` and get some insights...

## String methods, inserting/replacing/removing data
[//]: # ".... . .-.. .-.. --- / - .... . .-. . -.-.-- / -.-- --- ..- / ..-. --- ..- -. -.. / -- . -.-.-- / .. / .- -- / -. --- - / ... ..- .-. . / .. ..-. / -.-- --- ..- / -.. .. -.. / .. - / .- -.-. -.-. .. -.. . -. - .- .-.. .-.. -.-- / --- .-. / .. -. - . -. - .. --- -. .- .-.. .-.. -.-- / -... ..- - / --. --- --- -.. / .--- --- -... -.-.-- / .. / -.- -. --- .-- / -- --- .-. ... . / -.-. --- -.. . / .-- .- ... / .--. .-. --- -... .- -... .-.. -.-- / .-. . .- .-.. .-.. -.-- / . .- ... -.-- / - --- / .. -.. . -. - .. ..-. -.-- / .- -. -.. / -.. . -.-. --- -.. . / -.--. .... --- .--. . ..-. ..- .-.. .-.. -.-- / ..- ... .. -. --. / ... --- -- . / --- -. .-.. .. -. . / - --- --- .-.. -.--.- .-.-.- / ..- -. .-.. . ... ... / -.-- --- ..- / -.-. .- -. / .- .-.. .-. . .- -.. -.-- / - . .-.. .-.. / - .... .. ... / .. ... / .- -. / . .- ... - . .-. / . --. --. .-.-.- / .. / .- -- / -.-- .- - .. -. .-.-.- / -. --- .-- / - .... .- - / -.-- --- ..- / .... .- ...- . / ..-. --- ..- -. -.. / - .... .. ... / -- . ... ... .- --. . --..-- / .-.. . - / -- . / -.- -. --- .-- .-.-.- / .. / .-- .. .-.. .-.. / --. .. ...- . / -.-- --- ..- / .- / -- .. -. .. / - .-. . .- - / .. ..-. / .. / .- -- / ... - .. .-.. .-.. / .- - / ... ...- -. .. -"

### String methods

Look at the `LanguageWorkedWith` the column. Languages are separated using `;`s. 

In [None]:
df['LanguageWorkedWith']

To deal with this type of data we will use `string methods`

In [None]:
df['LanguageWorkedWith'].str.contains("Python",na=False)

`df['LanguageWorkedWith'].str.contains` checks if the strings within contain `Python`. The `na=False` part is to deal with `NaN` values.

Later on we will see how to deal with `NaN` once and for all

Again we can use `df['LanguageWorkedWith'].str.contains("Python",na=False)` as a filter..

In [None]:
mask = df['LanguageWorkedWith'].str.contains("Python",na=False)
df.loc[mask,["Respondent","LanguageWorkedWith"]] # we can see the Respondent and LanguageWorkedWith

### inserting/replacing/removing data

Say during our data analysis we found that a few columns don't really provide us with essential information. We can remove them safely without losing valuable information. We can get rid of these columns by using `df.drop` method.

Say we find columns: `OrgSize` and `FizzBuzz` to be of very less interest.
>`OrgSize`: Approximately how many people are employed by the company or organization you work for?

>`FizzBuzz`: Have you ever been asked to solve FizzBuzz in an interview?

In [None]:
new_dataframe = df.drop(columns=['OrgSize', 'FizzBuzz']) 
new_dataframe

Another way of removing columns is to use `drop` with `inplace=True`. `inplace = True` tells `drop` to remove the columns directly from `df` instead of just returning a new dataframe. But before doing that lets save 'OrgSize' and  'FizzBuzz' separately first.

In [None]:
to_be_removed = df[['OrgSize', 'FizzBuzz']]
print("to_be_removed:\n",to_be_removed)
df.drop(columns=['OrgSize', 'FizzBuzz'],inplace=True) #notice how this time it doesn't return anything

Now if you check the columns in `df` you will notice that `OrgSize` and `FizzBuzz` aren't amongst them:

In [None]:
print('OrgSize' in df.columns)
print('FizzBuzz' in df.columns)
print('Age' in df.columns) # as a check..

Notice all those `NaN` values in there? These annoying little bastard are hard to deal with. You will have to make tough decisions about what to do with them. If a row is filled with them then you have no other option than to remove it. Otherwise if say only a few rows have missing values then we can do something smart and fill in these `NaN` values with something we believe can fit in (like if all people seem to have an `age` around 20yrs old (the mean of their ages is 20) then for people who's `age` is `NaN` we can safely fill in 20 as their age won't (in most cases) be very far off from it).

We can use `dropna` (drop not a number) function to get rid of them ... Or we can use `fillna` function to fill them with the desired value (desired value can be 0 or mean() or median() etc )

To demonstrate these I will copy the dataframe...

In [None]:
df_copy1 = df.copy()

In [None]:
df_copy1.dropna() #notice how this returns a result. inplace = False by default.

In [None]:
df_copy1.dropna(inplace=True) 
df_copy1 #df_copy1 got affected as inplace=True

In [None]:
df_copy1.isnull() # you can use .isnull to check if a given value is null..

add `any()` after that and it will show you the column name and whether that column has any missing values

In [None]:
df_copy1.isnull().any() 

again add `any()` to see if any columns have True (NaN values) in them...

In [None]:
print(df_copy1.isnull().any().any()) # df_copy1 has no NaN values whatsoever
print(df.isnull().any().any()) # df has NaN values

Alright now lets try `fillna` with 0, mean and median. Note: fillna takes all lot of time to run on the entire dataset so we will just focus on the tail of the dataset...

In [None]:
df_copy2 = df.tail().copy()# we will only be seeing the changes on the last 5 elements else it will take too long
df_copy3 = df.tail().copy()# we will only be seeing the changes on the last 5 elements else it will take too long
df_copy4 = df.tail().copy()# we will only be seeing the changes on the last 5 elements else it will take too long

In [None]:
df.tail() #let us see how the tail looks before fillna. In particular notice the NaN in the Age column.

In [None]:
df_copy2.fillna(0,inplace = True)
df_copy2.tail() 
# fillna(0) replaces all values with 0. Even things like Country are set to 0 (which doesn't make sense...)

In [None]:
df_copy3.fillna(df_copy3.mean(),inplace = True)
df_copy3.tail()
# fillna(df_copy3.mean()) replaces only numeric values with 0. (notice Age... rest are unaffected)

In [None]:
df_copy4.fillna(df_copy4.median(),inplace = True)
df_copy4.tail()
# similarly fillna(df_copy4.median()) also affects only numeric values (because how exactly will you find the meadian of string!)

As you can clearly see you can't just blindly use `fillna` or `dropna`. You can make more detailed changes by selecting the column you wish to change...

In [None]:
df_copy4.Country.fillna("India",inplace = True) #lets fill NaN countries with "India"
df_copy4.EdLevel.fillna("Unspecified",inplace = True) #lets fill NaN EdLevel with "Unspecified"
df_copy4.UndergradMajor.fillna("Unspecified",inplace = True) #lets fill NaN UndergradMajor with "Unspecified"
#and so on...

df_copy4.tail()

Remember we removed the columns `OrgSize` and `FizzBuzz` from `df` using `df.drop(['OrgSize','FizzBuzz'],inplace = True)`. Well now as an exercise lets try inserting these columns back into `df`...

In [None]:
df.insert(loc = 0,column='OrgSize',value=to_be_removed.OrgSize) #column is the column name. loc = 0 means location = first (add column at first)
df.insert(loc = 0,column='FizzBuzz',value=to_be_removed.FizzBuzz) #value is actual column

In [None]:
df.head(2)

<hr style="border:1px dashed  gray"> </hr>

The `MainBranch` column has been described as following in the `so_survey_2019.pdf` file :

![MainBranch](assets\Images\ques.png)

Notice how these description are too long and can simply be `encoded` as :
- `0` for `I am a developer by profession`
- `1` for `I am not primarily a developer, but I write code sometimes as part of my work`
- `2` for `I used to be a developer by profession, but no longer am`
- `3` for `I am a student who is learning to code`
- `4` for `I code primarily as a hobby`
- `5` for `None of these`
- `-1` for `NaN`

So essentially what we are interested in doing is to replace these long strings with a single digit. Note: for `NaN` we will still have to use `fillna`. Doing `df_replace['MainBranch'].replace(to_replace = None, value = -1 ,inplace = True)` won't actually work.

In [None]:
df_replace = df.copy() #to keep df unaffected we shall copy it again.
df_replace['MainBranch'].replace(to_replace = "I am a developer by profession", value = 0 ,inplace = True)
df_replace['MainBranch'].replace(to_replace = "I am not primarily a developer, but I write code sometimes as part of my work", value = 1 ,inplace = True)
df_replace['MainBranch'].replace(to_replace = "I used to be a developer by profession, but no longer am", value = 2 ,inplace = True)
df_replace['MainBranch'].replace(to_replace = "I am a student who is learning to code", value = 3 ,inplace = True)
df_replace['MainBranch'].replace(to_replace = "I code primarily as a hobby", value = 4 ,inplace = True)
df_replace['MainBranch'].replace(to_replace = "None of these", value = 5 ,inplace = True)
df_replace['MainBranch'].fillna(-1,inplace = True)
df_replace['MainBranch']

In [None]:
df_replace['MainBranch']= df_replace['MainBranch'].astype(int) # make the column int instead of float64

In [None]:
df_replace['MainBranch']

Alright. Now lets move on the `Matplotlib`. Again just like `NumPy`, `Pandas` is an awesome library with a ton of interesting and useful features. And just like `NumPy` I can't do justice to it in just one notebook. So you will have to learn more about it on your own.

<hr style="border:1px solid gray"> </hr>

# Matplotlib

The official documentation defines `matplotlib` as :
>Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

We will use `matplotlib` to create different types of graphs and use them to visualize our data. (it can display images too!)

In [None]:
import matplotlib.pyplot as plt #this is how matplotlib is usually imported.
import numpy as np # we will need numpy too..
%matplotlib inline 

`%matplotlib inline` is usually added when using `matplotlib` in `jupyter notebook`. It is a magic function that renders the figure in a notebook. No need to worry about it. It isn't really necessary for you to understand... (You can read this [StackOverflow question](https://stackoverflow.com/questions/43027980/purpose-of-matplotlib-inline) about it if you are interested)

In [None]:
plt.plot([1, 2, 3, 4], [1, 4, 2, 3])

As you can probably tell from the above example `[1, 2, 3, 4]` contains the `x coordinates` of the points and `[1, 4, 2, 3]` contain the `y coordinates`.

So we are plotting :
- (1,1)
- (2,4)
- (3,2)
- (4,3)

Now lets try plotting a `sine` wave. We will first create an array that will represent theta. And then we will create another array that will have the corresponding value of `sine` at that time.

In [None]:
theta = np.arange(-2 * np.pi, 2 * np.pi,1)
theta

In [None]:
sine = np.sin(theta)
sine

In [None]:
plt.plot(theta,sine)

What the!?? This doesn't look like a sine wave does it?? What happened?

Well we are using only 13 point (check time.shape) between -2$\pi$ to 2$\pi$ so we are getting such an ugly graph... Lets increase the number of points by changing the time..

In [None]:
theta = np.arange(-2 * np.pi, 2 * np.pi,0.01) 
print(theta,theta.shape) # we are using 1257 points this time..
sine = np.sin(theta)

In [None]:
plt.plot(theta,sine)

Aaah yes! Beautiful.

Well my favorite color is orange. So...

In [None]:
plt.plot(theta,sine,color='orange') # and just like that we have an orange graph

Well why stop here... lets get some more trigonometric graphs...

In [None]:
cos = np.cos(theta)
plt.plot(theta,cos,color='red') 

In [None]:
tan = np.tan(theta)
plt.plot(theta,tan)
plt.ylim(-5, 5) # this is to limit the y axis of the graph from -5 to 5...
plt.show()

In [None]:
np.arange(0.4, 1.1, 0.1)

All this is well and good but I want to see all graphs overlayed on top of one-another...

In [None]:
plt.figure(figsize = (10,5)) # specify the size of the figure (width,height).. always add this before adding .plot..

plt.plot(theta, sine, label='sine')  
plt.plot(theta, cos, label='cos')  
plt.plot(theta, tan, label='tan')
plt.ylim(-5, 5) # try commenting this line and running the cell again..

plt.xlabel('theta') # add label for x axis
plt.ylabel('Magnitude') # add label for y axis
plt.title("Trig Functions") # add title
plt.legend() # add legend

plt.show()

## bar graph

In [None]:
x = [1,3,5,7,9]
y = [5,2,7,8,2]

x1 = [5,4,4,8,10]
y1 = [8,6,2,5,6]

plt.bar(x,y, label="bar one")
plt.bar(x1,y1, label="bar two", color='g')
plt.xlabel('bar number')
plt.ylabel('bar height')
plt.title('Sample bar graph')
plt.legend()

plt.show()

Remember how we found the [the ages at which people wrote their first code](#ages_internal_link) and tried plotting it... Well I didn't go deeper into it because you need `matplotlib` along side `pandas` to improve and modify the graph... Let me demonstrate...

In [None]:
df['Age1stCode'].value_counts().plot(kind='bar') #this time lets try plotting all the ages instead of just 20

Oh lord! Clearly we need a larger figure...

In [None]:
plt.figure(figsize=(20,5))
df['Age1stCode'].value_counts().plot(kind='bar') 

Aaah! much better. Notice how `matplotlib` and `pandas's plot` work together here...

Lets try `Age` too

In [None]:
plt.figure(figsize=(20,5))
df['Age'].value_counts().plot(kind='bar') 

Aaaa... Bigger graph won't do it this time.. lets use `horizontal bar graph`

In [None]:
plt.figure(figsize=(20,25))
df['Age'].value_counts().plot(kind='barh') 

The graph looks ok. But let me ask you something. How does the age group `10 to 20 years` compare to other age groups at using the site? 

Clearly it is hard to tell. We will need a different type of graph to get this information. We shall use a `histogram`.

In [None]:
bins = np.arange(0,100,10)
df['Age'].plot(kind='hist',bins = bins,grid=True) 
plt.locator_params(axis='x', nbins=len(bins)) # this line is to increase the number of x axis tics... Try commenting it 
plt.show()

In the above code I first specified how to create the bins: `bins = np.arange(0,100,10)`

bins : `array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])`

So the first bin goes from 0 to 10.

The next bin goes from 10 to 20.

The next from 20 to 30.

And so on...

In the line `df['Age'].plot(kind='hist',bins = bins,grid=True)` we are specifying the graph to be `hist`, the bins to be `[ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90]` and to show the grid.

The `plt.locator_params(axis='x', nbins=len(bins))` line is used to add x axis tics at 0,10,20...90. Try commenting it out and seeing the results..

## Subplots

In `matplotlib` you can draw several subplots in one main plot.

In [None]:
x = np.arange(0,10,0.01)
sq = x**2 # this is how you raise a number to some power in python
cube = x**3
log = np.log(x)

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2,2,1)
plt.plot(x,x)
plt.title('Linear')

plt.subplot(2,2,2)
plt.plot(x,sq)
plt.title('Quadratic')

plt.subplot(2,2,3)
plt.plot(x,cube)
plt.title('Cubic')

plt.subplot(2,2,4)
plt.plot(x,log)
plt.title('Logarithmic')

plt.show()

Lets try to understand what `plt.subplot(2,2,1)` means. The first `2` means 2 rows. The next `2` means 2 columns.

So we want to display `4 subplot` (2$*$2 = 4) in a grid of `2 rows` and `2 columns`

The third number in `plt.subplot(2,2,1)`, here `1` tells `matplotlib` which `subplots` we are plotting. In this case it can be 1,2,3 or 4 signifying the respective `subplots`.

So to display 2 subplots side by side we will do:

In [None]:
plt.subplot(1,2,1) # 1 row , 2 columns, we are drawing subplot = 1 
plt.plot(x,x)
plt.title('Linear')

plt.subplot(1,2,2) # 1 row , 2 columns, we are drawing subplot = 2
plt.plot(x,sq)
plt.title('Quadratic')

plt.show()

Similarly to display 2 subplots on top of one another we will do:

In [None]:
plt.figure(figsize=(5,10))

plt.subplot(2,1,1) # 2 row , 1 columns, we are drawing subplot = 1
plt.plot(x,cube,'g--') # green and made of ---
plt.title('Cubic')

plt.subplot(2,1,2) # 2 row , 1 columns, we are drawing subplot = 2
plt.plot(x,log,'r*') # red and made of *
plt.title('Logarithmic')

plt.show()

## Images

You can also read and display images using `matplotlib`. Lets try reading a test image stored in the `assets/Images` directory (in programming we refer to folders as directories... well it sounds cooler).

In [None]:
plt.imread('assets/Images/test.jpg')

That doesn't look like an image! Well it actually is an image. Well to be exact these are the pixel values that make up the image given to us in the form of an `ndarray`.

We can display this image by using `plt.imshow`

In [None]:
test_image = plt.imread('assets/Images/test.jpg')
plt.imshow(test_image)

You can also save images using `imsave`:

In [None]:
plt.imsave(fname="my_image.jpg",arr=test_image) #this will save the above image as "my_image.jpg"  

Note: the above line will save the image right next to where you have this `jupyter notebook` saved. We call this place the `working directory` in programming.

<hr style="margin: auto;
           height: 40px;
           background: linear-gradient(135deg, #ECEDDC 25%, transparent 25%) -20px 0, linear-gradient(225deg, #ECEDDC 25%, transparent 25%) -20px 0, linear-gradient(315deg, #ECEDDC 25%, transparent 25%), linear-gradient(45deg, #ECEDDC 25%, transparent 25%);
           background-size: 40px 40px;
           background-color: #EC173A;"></hr>

**Congratulations! You have completed the first notebook of Drishti's workshop on Machine Learning/ Deep Learning /Computer vision. I hope you learned a lot. Although I have said this twice already I would still like to emphasize that this is just the tip of the iceberg, each of these 3 libraries is incredibly powerful and filled with features. The only way you can possibly learn more about them is to use them in your projects. Our aim was just to get you informed and excited about what you can do with these libraries.**

# Resources

The following is a list of resources that you can follow to learn more about these libraries:

### NumPy

- [Official documentation](https://numpy.org/doc/1.18/user/quickstart.html)
- [NumPy Tutorials by Amulya's Academy on youtube](https://www.youtube.com/playlist?list=PLzgPDYo_3xukqLLjNeuCxj4CwvkJin03Z)
- [NumPy Cheat Sheet — Python for Data Science](https://www.dataquest.io/blog/numpy-cheat-sheet/)

### Pandas

- [Official documentation](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#min-tut-01-tableoriented)
- [Pandas tutorials by Corey Schafer on youtube](https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS)
- [Data analysis in Python with pandas by Data School on youtube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y) -  **recommended**
- [The Pandas DataFrame – loading, editing, and viewing data in Python]()

### Matplotlib

- [Official documentation](https://matplotlib.org/2.1.1/tutorials/index.html)
- [Sentdex's tutorial on Matplotlib on youtube: Matplotlib Tutorial Series - Graphing in Python](https://www.youtube.com/playlist?list=PLQVvvaa0QuDfefDfXb9Yf0la1fPDKluPF)
- [Sentdex's website:Introduction to Matplotlib and basic line](https://pythonprogramming.net/matplotlib-intro-tutorial/) - you can get the code he uses in his videos here.
- [Matplotlib - Simple Plot by tutorials point](https://www.tutorialspoint.com/matplotlib/matplotlib_simple_plot.htm)

Have a nice day!