<h1 align='center'> COMP2420/COMP6420 - Introduction to Data Management, Analysis and Security</h1>

<h2 align='center'> Lab 02 - Data Analysis with NumPy & Pandas</h2>

*****

In Lab 01, we studied the basic of data structures in Python. Clearly, visualizing the data in the form of tables instead of lists and dictionaries, would have been a much better experience for most of you. **NumPy** and **Pandas** modules in Python provide that ease of visualization and computation, especially while working with huge amounts of data.   

**Pandas** is one of the most widely used python modules for data science. It provides high-performance, easy to use structures and data analysis tools. Among other useful data structures, Pandas provides in-memory 2D table object called **Dataframe**. It works like a spreadsheet with column names and row labels. With dataframes,Pandas is capable of providing many additional functionalities like creating pivot tables, computing columns based on other columns and plotting graphs.

**NumPy** (which stands for ‘Numeric Python’) is a Python module which provides fast mathematical computation on arrays and matrices. It provides the essential multi-dimensional array-oriented computing functionalities designed for high-level mathematical functions and scientific computation. Since, arrays and matrices are an essential part of the Machine Learning ecosystem, NumPy along with Machine Learning modules like Scikit-learn, Pandas, Matplotlib, TensorFlow, etc. complete the Python Machine Learning Ecosystem.

For those of you, who still aren't impressed by the computation power and speed instilled by these two modules, let's do a little test to observe the difference in computation time between traditional Python and NumPy. Let's call this test **'Why NumPy?'**. 

In [2]:
# First things first! 
# We can't proceed without importing the necessary modules
import numpy as np
import pandas as pd
# Notice how we have used aliases for both these modules 
# so we won't have to type in their full names repeatedly 
import time
import math

In [3]:
# Now, let's compute the time taken for traditional Python v/s NumPy 
# to add elements of two lists with 10 million items each

def trad_version():
    t1 = time.time()
    X = range(10000000)
    Y = range(10000000)
    Z = [x+y for x,y in zip(X,Y)]
    return time.time() - t1

def numpy_version():
    t1 = time.time()
    X = np.arange(10000000)
    Y = np.arange(10000000)
    Z = X + Y
    return time.time() - t1

traditional_time = trad_version()
numpy_time = numpy_version()

print("Traditional Time = " + "%.4f" %(traditional_time) + " s.")
print("NumPy Time       = " + "%.4f" %(numpy_time) + " s.")
print("We can see that NumPy is approximately %d times faster than Traditional Python." 
      %(round(traditional_time / numpy_time)))

Traditional Time = 0.5582 s.
NumPy Time       = 0.0647 s.
We can see that NumPy is approximately 9 times faster than Traditional Python.


Besides this, there are some other advantages like - 

1. For machine learning operations, t is much easier to use NumPy’s **ndarray** (which provides a lot of convenient and optimized implementations of essential mathematical operations on vectors) over traditional **lists**.

2. NumPy arrays are capable of performing all basic operations such as addition, subtraction, element-wise product, **matrix dot product**, element-wise division and conditional operations.

3. Pandas **DataFrame** and **Series** objects are simple and intuitive to access, offering a variety of functionalities like creating **pivot tables**, grouping similar data, computing columns based on other columns and **plotting graphs**.


Now that we have established the superiority of NumPy and Pandas over traditional Python, let's introduce you to the contents of these modules. Many of the functions and data structures discussed here, will be used repeatedly in the coming labs and assignments. So, **keep your eyes peeled**!! 
<br/><br/>


<h3 align='center'>Pandas</h3>

**Pandas** have two different data structures that we can use for manipulating different types of data. Here we will see - 

* **Series** (1-D data structures) - It is a 1-dimensional labeled array capable of holding any datatype (integers, strings, floating point numbers etc.). The axis labels are collectively referred to as the **index**.


* **DataFrames** (2-D data structures) - It is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a **spreadsheet** or **SQL table**, or a **dictionary of Series objects**. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input datatypes.

### Series

In [9]:
# Making a series from a list (with indexes / row names)
# If you don't specify index, default index is 0, 1, 2, 3 ...

s = pd.Series([1,2,3,4,5],index=['a', 'b', 'c', 'd', 'e'] )
#s = pd.Series([1,2,3,4,5])
s

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [5]:
# Series from dictionary, where the keys are used as index, and the values as series values 
dict_s = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(dict_s)[])

a    0.0
b    1.0
c    2.0
dtype: float64

In [7]:
# Operations on series work more or less the same way as lists
# 1st element of the series
print("1st element of series:", s[0], "\n")

1st element of series: 1 



In [13]:
# Element with index 'd'
print ("Element with index 'd':", s['d'], "\n")

print("Element with index 'a':", dict_s['c'])

Element with index 'd': 4 

Element with index 'a': 2.0


In [17]:
# Slicing the given series
print(s[3:5], "\n")
print(s[0:3])


d    4
e    5
dtype: int64 

a    1
b    2
c    3
dtype: int64


In [21]:
# Comparing each element with the median of the series
print("Median:", s.median())
for item in s:
    print(item, "is larger than median" if item >= s.median() else "is smaller than median")
    
print("Mean:", s.mean())
for item in s:
    print(item, "is larger than mean" if item >= s.mean() else "is smaller than mean")

Median: 3.0
1 is smaller than median
2 is smaller than median
3 is larger than median
4 is larger than median
5 is larger than median
Mean: 3.0
1 is smaller than mean
2 is smaller than mean
3 is larger than mean
4 is larger than mean
5 is larger than mean


In [26]:
# Print the elements that are greater than the median of series
print(s[s > s.median()])
print("The 3rd Element is " + str(s[2]))
print(s[s < s.mean()])

d    4
e    5
dtype: int64
The 3rd Element is 3
a    1
b    2
dtype: int64


### DataFrame

In [28]:
# Making a dataframe from a dictionary of series
dict_s = {'one' : pd.Series([1, 'Harry', 8.7], index=['a', 'b', 'd']),
          'two' : pd.Series([9, 'Matt', 7.5, 6.1], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(dict_s)
df

dict_new = {'one':pd.Series([1, 'Mia', 9, 9], index=['uid', 'Name','Grade1', 'Grade2']), 
            'two':pd.Series([2, 'Kat', 9.2], index=['uid', 'Name','Grade1'])} #NaN displayed when no value 
dn = pd.DataFrame(dict_new)
print(dn)

        one  two
Grade1    9  9.2
Grade2    9  NaN
Name    Mia  Kat
uid       1    2


Passing an **index** (row labels) and **columns** (column labels) along with your data is optional, but if you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting **DataFrame**. Thus, a **dict of Series** plus a specific index will discard all data not matching up to the passed index.

You can also load a CSV file / Excel worksheet into a Python dataframe, using the following method:

In [34]:
df_cars = pd.read_csv('cars.csv')
# head() displays the first 5 rows of the resultant dataframe
df_cars.head


dn_irisTest = pd.read_csv('iris_test.csv')
dn_irisTest.head
dn_irisTest.columns
dn_irisTest.values
dn_irisTest.info



<bound method DataFrame.info of     sepal_length  sepal_width  pedal_length  pedal_width            class
0            5.4          3.9           1.7          0.4      Iris-setosa
1            5.4          3.9           1.3          0.4      Iris-setosa
2            5.0          3.6           1.4          0.2      Iris-setosa
3            4.6          3.4           1.4          0.3      Iris-setosa
4            5.0          3.4           1.6          0.4      Iris-setosa
5            6.0          2.2           5.0          1.5   Iris-virginica
6            7.2          3.6           6.1          2.5   Iris-virginica
7            5.8          2.7           5.1          1.9   Iris-virginica
8            6.0          3.0           4.8          1.8   Iris-virginica
9            6.4          2.8           5.6          2.2   Iris-virginica
10           5.9          3.0           4.2          1.5  Iris-versicolor
11           6.4          2.9           4.3          1.3  Iris-versicolor
12    

Similarly, there are some other common methods to -

* **df.columns** - returns a list of names of the columns
* **df.values** - returns all values without index and column names in numpy array
* **df.info()** - returns info about the data and datatypes for every column
* **df.head()** - returns the top 5 (default) rows of the dataframe
* **df.tail()** - returns the bottom 5 rows of the dataframe

In [36]:
# You can slice the dataframe in the following ways -

# Accessing individual columns
# df_cars['type']
# df_cars.hp
df_cars[['type','disp']].head()
df_cars[['type','disp']].tail()
df_cars[['type','disp', 'speed']].head()

Unnamed: 0,type,disp,speed
0,AMC Ambassador Brougham,360.0,11.0
1,AMC Ambassador DPL,390.0,8.5
2,AMC Ambassador SST,304.0,11.5
3,AMC Concord DL 6,232.0,18.2
4,AMC Concord DL,258.0,15.1


In [37]:
# Accessing individual row
# df_cars.loc[2] 
df_cars.iloc[1:3]
df_cars.iloc[:4]

Unnamed: 0,type,mpg,cyl,disp,hp,wt,speed
0,AMC Ambassador Brougham,13.0,8,360.0,175.0,3821,11.0
1,AMC Ambassador DPL,15.0,8,390.0,190.0,3850,8.5
2,AMC Ambassador SST,17.0,8,304.0,150.0,3672,11.5
3,AMC Concord DL 6,20.2,6,232.0,90.0,3265,18.2


In [38]:
# Accessing particular columns of selected rows
# df_cars[['type','disp']][1:3]
df_cars[1:3][['type','cyl']]
df_cars.iloc[:4]['speed']

0    11.0
1     8.5
2    11.5
3    18.2
Name: speed, dtype: float64

### Filtering Rows
We can also filter rows or select a subset of interesting rows based on the boolean comparison of two or more columns.

In [15]:
positive_pt = df_cars[df_cars.hp > df_cars.disp]
positive_pt

Unnamed: 0,type,mpg,cyl,disp,hp,wt,speed
240,Maxda RX-3,18.0,3,70.0,90.0,2124,13.5
248,Mazda RX-2 Coupe,19.0,3,70.0,97.0,2330,13.5
249,Mazda RX-4,21.5,3,80.0,110.0,2720,13.5
250,Mazda RX-7 Gs,23.7,3,70.0,100.0,2420,12.5


Now, let's see how much you have grasped from the brief rundown above. Answer the following questions based on the above dataframe. It should be already loaded into memory if you've run the above cells, so, you don't have to load the dataframe from the CSV again.

## Exercise 1

1. How would you print the last 25 rows of the dataframe?

In [40]:
# YOUR ANSWER HERE

df_cars[-26:-1]

Unnamed: 0,type,mpg,cyl,disp,hp,wt,speed
380,Volkswagen Dasher (Diesel),43.4,4,90.0,48.0,2335,23.7
381,Volkswagen Dasher,25.0,4,90.0,71.0,2223,16.5
382,Volkswagen Dasher,26.0,4,79.0,67.0,1963,15.5
383,Volkswagen Dasher,30.5,4,97.0,78.0,2190,14.1
384,Volkswagen Jetta,33.0,4,105.0,74.0,2190,14.2
385,Volkswagen Model 111,27.0,4,97.0,60.0,1834,19.0
386,Volkswagen Pickup,44.0,4,97.0,52.0,2130,24.6
387,Volkswagen Rabbit C (Diesel),44.3,4,90.0,48.0,2085,21.7
388,Volkswagen Rabbit Custom Diesel,43.1,4,90.0,48.0,1985,21.5
389,Volkswagen Rabbit Custom,29.0,4,97.0,78.0,1940,14.5


2. How would you filter out the cars which have a mileage (mpg) greater than 40, but speed less than 20?  

In [46]:
# YOUR ANSWER HERE
mileage_mpg = df_cars[(df_cars.mpg > 40) & (df_cars.speed < 20)]
mileage_mpg


Unnamed: 0,type,mpg,cyl,disp,hp,wt,speed
117,Datsun 210,40.8,4,85.0,65.0,2110,19.2
232,Honda Civic 1500 GL,44.6,4,91.0,67.0,1850,13.8
247,Mazda GLC,46.6,4,86.0,65.0,2110,17.9
340,Renault Lecar Deluxe,40.9,4,85.0,,1835,17.3
395,Volkswagen Rabbit,41.5,4,98.0,76.0,2144,14.7


Suppose, your friend Matt wants to buy a new car, and asks for your help to zero in on the best suitable model for him from this dataset. Someone told him that mileage is a very important factor while choosing a car, and wants you to sort the dataset by decreasing value of mpg.   

3. How would you print the data for **top 15 cars** with the highest miles per gallon (mpg)?

In [59]:
# YOUR ANSWER HERE
# SAVE YOUR RESULT TO A NEW DATAFRAME FOR FURTHER COMPUTATIONS

df = pd.DataFrame(df_cars['mpg'])

arr = df.values
arr.sort(axis=0)
df = pd.DataFrame(arr)[:15]
print(df)

       0
0    9.0
1   10.0
2   10.0
3   11.0
4   11.0
5   11.0
6   11.0
7   12.0
8   12.0
9   12.0
10  12.0
11  12.0
12  12.0
13  13.0
14  13.0


He further finds out that a vehicle's horsepower when divided by its weight gives us the power-to-weight ratio, which is represented as horsepower to 10 pounds. **The higher that number, the better your car is going to be in terms of performance.** 

3. How would you find out the top 10 cars (from the above result) with the highest **horsepower (hp) / weight** ratio?

In [84]:
# YOUR ANSWER HERE
# WORK ON THE RESULTANT DATFRAME FROM PREVIOUS QUESTION 
# AND TRANSFER YOUR RESULTS TO A NEW DATAFRAME   

result = df_cars['hp'].div(df_cars['wt'])

arr = result.values
arr.sort(axis = 0)
#result = arr[::].sort()
#arr.sort(ascending=False)
result_sorted = result[::-1]
print(result_sorted[-15:-1])  #???






14    0.024413
13    0.024267
12    0.024181
11    0.024051
10    0.023957
9     0.023590
8     0.023022
7     0.022799
6     0.022712
5     0.022257
4     0.021824
3     0.021813
2     0.020979
1     0.020615
dtype: float64


Lastly, Matt finds out that Canberra city only has authorized service centers for **Honda** and **Toyota**. If he buys any other car makes, he would have to go to Sydney for servicing and parts.

4. How would you filter only the cars produced by Honda and Toyota from the above result?

In [20]:
# YOUR ANSWER HERE
# WORK ON THE RESULTANT DATFRAME FROM PREVIOUS QUESTION 
# AND TRANSFER YOUR RESULTS TO A NEW DATAFRAME  




If all went well, you would have provided your friend with a choice between **4 cars**, which is a pretty good shortlist, considering you started with **406 cars** in your original dataset. **GOOD JOB!**

<h3 align='center'>NumPy</h3>

**NumPy** is a core Python module for scientific computation. It provides a high-performance multidimensional array object, and tools for working with these arrays.


### Arrays

A NumPy **Array** is a grid of values, all of the same type, and is indexed by a tuple of non-negative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension. We can initialize numpy arrays from nested Python lists, and access elements using square brackets.

In [21]:
# 1-dimensional arrays
x = np.array([2,5,18,14,4])
print ("\n Deterministic 1-dimensional array \n")
print (x)
print ("Shape:", x.shape)


 Deterministic 1-dimensional array 

[ 2  5 18 14  4]
Shape: (5,)


In [22]:
# 2-dimensional arrays
x = np.array([[2,5,18,14,4], [12,15,1,2,8]])
print ("\n Deterministic 2-dimensional array \n")
print (x)
print ("Shape:", x.shape)

# 2-dimensional random generated array
x = np.random.rand(5,5)
print ("\n Random 5x5 2-dimensional array \n")
print (x)
print ("Shape:", x.shape)


 Deterministic 2-dimensional array 

[[ 2  5 18 14  4]
 [12 15  1  2  8]]
Shape: (2, 5)

 Random 5x5 2-dimensional array 

[[0.09196467 0.92650548 0.85404888 0.16510818 0.44117475]
 [0.40747146 0.86648725 0.00132298 0.87715097 0.01762197]
 [0.26789454 0.21867559 0.06707643 0.79695499 0.93692706]
 [0.95890391 0.92899262 0.04979729 0.35443349 0.76728743]
 [0.23972855 0.09007526 0.5595711  0.2689249  0.6594263 ]]
Shape: (5, 5)


In [23]:
# 4x4 array of zeroes
x = np.zeros((4,4))
print ("\n 4x4 array with zeros \n")
print(x)
print ("Shape:", x.shape)


 4x4 array with zeros 

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
Shape: (4, 4)


In [24]:
# 4x4 array of ones
x = np.ones((4,4))
print ("\n 4x4 array with ones \n")
print (x)
print ("Shape:", x.shape)


 4x4 array with ones 

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
Shape: (4, 4)


In [25]:
# identity matrix of size 4
x = np.eye(4)
print ("\n Identity matrix of size 4\n")
print(x)
print ("Shape:", x.shape)


 Identity matrix of size 4

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]
Shape: (4, 4)


### Manipulating Arrays

In [26]:
x = np.random.rand(4,3)
print(x)
print("\n Row zero \n")
print(x[0])
print("\n Column 2 \n")
print(x[:,2])
print("\n Submatrix \n")
print(x[1:3,0:2])
print("\n Entries > 0.5 \n")
print(x[x>0.5])
print("\n Single Element\n")
print(x[1,2])

[[0.79599207 0.11665537 0.9349471 ]
 [0.39125563 0.12865951 0.21934255]
 [0.00526183 0.6899923  0.7638471 ]
 [0.76578274 0.85050239 0.56218313]]

 Row zero 

[0.79599207 0.11665537 0.9349471 ]

 Column 2 

[0.9349471  0.21934255 0.7638471  0.56218313]

 Submatrix 

[[0.39125563 0.12865951]
 [0.00526183 0.6899923 ]]

 Entries > 0.5 

[0.79599207 0.9349471  0.6899923  0.7638471  0.76578274 0.85050239
 0.56218313]

 Single Element

0.21934255272641856


### Array Arithmetic

In [27]:
x = np.random.rand(2,4)
print (x)
print('\n Mean value of all elements')
print (np.mean(x))

print('\n Standard Deviation of all elements')
print (np.std(x)) 

print('\n Median value of all elements')
print (np.median(x)) 

print('\n Sum of all elements')
print (np.sum(x)) 

print('\n Product of all elements')
print (np.prod(x)) 

print ("\n Transpose of the matrix \n")
print (x.T)
print ("Shape:", x.T.shape)

# Multiplication and addition with scalar value
print("\n Matrix 2x+1 \n") 
print(2*x + 1)

y = np.array([2,-1,3])df_cars
z = np.array([-1,2,2])
# Vector-Vector Dot Product
print('\n y:',y)
print(' z:',z)
print('\n Vector-Vector Dot Product')
print(np.dot(y,z))

[[0.75497015 0.36867949 0.28921755 0.29211605]
 [0.27687607 0.03741104 0.82174466 0.79555071]]

 Mean value of all elements
0.45457071381475445

 Standard Deviation of all elements
0.27568822050414216

 Median value of all elements
0.33039776663925335

 Sum of all elements
3.6365657105180356

 Product of all elements
0.00015923875012866003

 Transpose of the matrix 

[[0.75497015 0.27687607]
 [0.36867949 0.03741104]
 [0.28921755 0.82174466]
 [0.29211605 0.79555071]]
Shape: (4, 2)

 Matrix 2x+1 

[[2.5099403  1.73735897 1.5784351  1.58423209]
 [1.55375213 1.07482208 2.64348932 2.59110142]]

 y: [ 2 -1  3]
 z: [-1  2  2]

 Vector-Vector Dot Product
2


## Exercise 2

We have borrowed the following example from [Philipp Janert](http://shop.oreilly.com/product/9780596802363.do)’s book. In this example we will use the famous Iris Flower dataset.

We will do some Machine Learning (ML) without explicitly using any fancy ML techniques. We have two different sets of data with us for this dataset

* [Train Data](./iris_train.csv), and 
* [Test Data](./iris_test.csv)

You should download both files and save it under a directory named ‘data’ in your current working directory. 

Iris data has five columns, namely - 
* sepal_length 
* sepal_width 
* pedal_length 
* pedal_width 
* class

Description of each field can be found [here](https://archive.ics.uci.edu/ml/datasets/iris).

Our dumb ML algorithm is known as **Nearest Neighbour Approach**, where we will predict the class label for a point in test set to be equal to class label of the nearest point in train set. Since, Nearest Neighbour Algorithm has not been covered in class yet, you only need to have a rudimentary understanding of the algorithm to implement it. The link would help you understand it better.

[<img src='./kNN.png'>](https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7)
Click **[k-NN(k-Nearest Neighbours Classification)](https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7)** or the image above to know more about this algorithm.

**HINT 1**: _For 2.1, 2.2, 2.3, 2.4, it would be worth looking at examples in class on how to read numpy arrays from csv files._

**HINT 2**: _Consider using np.argmin for task 5. You can also have a look at the class example to see how we got distance of a point from all other points._

1. Read the train set data into a numpy array (you can call it **train_features**) to get the train features. Use columns 0,1,2,3 from the csv.

In [28]:
# YOUR ANSWER HERE




2. Read the train set labels into a numpy array (you can call it **train_labels**) to get the training labels/classes. Use only column 4 for this array.

In [29]:
# YOUR ANSWER HERE

# Print unique labels




3. Similarly, read the test set data into a numpy array(you  can call it **test_features**) to get test features, using columns 0,1,2,3 from the iris_test.csv. Also, read the test set labels into a numpy array(and call it **test_labels**) to get test labels/classes, using only column 4.

In [30]:
# YOUR ANSWER HERE




4. Write code for the following -
    * For each point in the test set, find the closest point in the train set.
    
    * The test instance gets the label of the closest point, save it as predicted_label.

In [31]:
# YOUR ANSWER HERE

# Going through each point in dataset

    # Calculating difference between a point in test set 
    # to all points in train set
    
    # Finding the location of the nearest point in train set
    
    

5. Now find the accuracy of our predictions by comparing the value of **predicted_label** with **test_label** for each value in the **test_labels**. 

In [32]:
# YOUR ANSWER HERE


