# Welcome to the Introduction to ML Course

This class will teach you introduction to data science and machine learning. No pre-requisite will be assumed, but it would be helpful to know some programming.

## Programming Environment

We will be using Python as the programming language. There are various ways to run Python - locally on your computer, on a company server, or on a third party cloud server. We will use [Google Colab](https://colab.research.google.com/), a free tool provided by Google. It uses notebooks for writing code. It can be used with GPU enabled, which will allow faster and more efficient execution of code.

## First Steps:

The first step would be to create an account on Google. Most people should have that already if you use any service such as Gmail, Google Drive, etc.

The second step would be to log in  to [Google Colab](https://colab.research.google.com/) so that you can create a notebook for yourself.

The third step would be to either create a new notebook or to clone this notebook. Instructions will be provided in class. 

# Getting Started

Before proceeding further, let's check which version of Python is installed. Note that we imported the **sys** library.

In [None]:
import sys
print(sys.version)

3.8.16 (default, Dec  7 2022, 01:12:13) 
[GCC 7.5.0]


# Features of Python Language
There are various resources for learning Python. Some of them are in the link below:
https://wiki.python.org/moin/BeginnersGuide/Programmers


Python is a weakly typed, interpreted language.

Strong vs Weak Typing


1. Strongly Typed Language

int x

string y

e.g. C++, Java, C#, etc

2. Weakly typed Language - we don't have to pre-specify the data type

x = "Hello"

x = 123

e.g. Javascript, Python

-----

Compiled vs Interpreted

1. Compiled: 

writing code -> compiling  -> producing an exe file

e.g. C++, Java, C#

2. Interpreted:

no compilation, write code and it's interpreted on the fly. I go from top to bottom.

e.g. Python, Javascript


no explicit memory management in Python or JS


## Variables

Way to store data. They are dynamically typed and the inferred data type can be viewed by using the ***type*** command.

Let us look at various examples.

In [None]:
name = "John Smith"

In [None]:
type(name)

str

In [None]:
x = 123

In [None]:
type(x)

int

In [None]:
price = 30.008

In [None]:
type(price)

float

## Complex Data Types

*List* is an example of complex data type that can store more than one variable. 

Some examples are below:

In [None]:
L = [1, 2, 3]

In [None]:
type(L)

list

In [None]:
L = [1, "John", 3.14]

Lists have an index for each position starting at index 0. We can look up elements using their index in the list.

In [None]:
L[0]  # starts at 0 index

1

In [None]:
L[1]

'John'

In [None]:
L[-1]

3.14

In [None]:
L[0:2] # starting at index 0 and ending before index 2

[1, 'John']

In [None]:
L[:2] # starting at beginning and ending before index 2

[1, 'John']

In [None]:
L[1:] # starting at element with index 1 and going to the end

['John', 3.14]

In [None]:
L.reverse() # reverse L in-place 

In [None]:
L

[9, 3.14, 'John', 19, 1]

Dictionary is another complex data type, which allows us to associate a *key* with each element value.

In [None]:
numbers = {'one':1, 'two':2, 'three':3} #dictionary is a key-value structure

In [None]:
type(numbers)

dict

We can access elements of a dictionary by using their keys.

In [None]:
numbers['one']

1

In [None]:
my_profile = {'name': "Sue", 'Company': "Vistra", 'Position': "Manager"}

In [None]:
my_profile['Position']

'Manager'

In [None]:
my_profile['Company']

'Vistra'

In [None]:
len(my_profile)

3

## If Statements And Looping

If statements help in logical processing using variou conditions

In [None]:
x = 10

In [None]:
if (x < 10):
  print("Less")
elif (x == 10):
  print("equal")
else:
  print("greater")


# note automatic indentation

equal


for loop for iterating a list and other iterable objects.

In [None]:
mylist = ['this', 'is', 'a', 'list']

for word in mylist:
  print(word)

this
is
a
list


## Task 1: 

1. Create a list having 5 elements: 10, 20, 30, 40, 50
2. Compute their mean. Note there is no built-in function for mean in list. You will have to use *sum()* and *len()* functions
3. Loop over the list, and if any element is less than the mean add 10 to it and if it is more than the mean subract 5 from it. 
4. Print the final list

## Functions

Functions is a re-usable, named block of code. 

In [None]:
def my_function():
  print("Hello from a function")

In [None]:
my_function()

Hello from a function


In [None]:
def sayHello(name):
  print("Hello " + name + "!")

In [None]:
sayHello("Bill")

Hello Bill!


In [None]:
sayHello("Mary")

Hello Mary!


## Task 2: 
1. Create a list of numbers from 1 to 100.
2. Write a function to check if a given number is prime.
3. Loop through the list and if a number is prime, print it to the screen.

Solution is below. Make sure you understand each part.

In [None]:
nums = list(range(1, 101))

In [None]:
def isPrime(num):
  for i in range(2, (int)(num/2) + 1):
    if (num % i == 0):
      return False

  return True

In [None]:
isPrime(7003)

False

In [None]:
for i in nums:
  if isPrime(i):
    print(i)

1
2
3
5
7
11
13
17
19
23
29
31
37
41
43
47
53
59
61
67
71
73
79
83
89
97


## Numpy Library For Numerical Data Processing

We will be working with numerical data most of the time, so numpy library will be very useful. More details are available at https://numpy.org/

In [None]:
# numpy is already installed, load it to your environment
import numpy as np  # np is an alias for numpy

Numpy has a much more efficient data structure called *array*. It can be initialized as:

In [None]:
# array
arr = np.array([1, 2, 3])

In [None]:
arr

array([1, 2, 3])

In [None]:
type(arr)

numpy.ndarray

In [None]:
arr.shape

(3,)

Arrays can be multi-dimensional. 

In [None]:
twoDim = np.array([[1,2],[3,4],[5,6],[7,8]])


In [None]:
print(twoDim) # 2-D array, matrix

[[1 2]
 [3 4]
 [5 6]
 [7 8]]


In [None]:
twoDim.shape

(4, 2)

In [None]:
twoDim.T  # transpose of the matrix

array([[1, 3, 5, 7],
       [2, 4, 6, 8]])

In [None]:
twoDim.size

8

How to access elements of the array?

In [None]:
twoDim[0,] # 0th row

array([1, 2])

In [None]:
twoDim[:,0] # 0th column

array([1, 3, 5, 7])

In [None]:
twoDim[2, 1]  # 2nd row and 1st column

6

np.arange allows us to generate number between a starting and ending value. The third parameter is optional and it indicates step size.

In [None]:
np.arange(0, 10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
np.arange(0, 10, 2)

array([0, 2, 4, 6, 8])

*reshape* allows us to convert a set of numbers into a matrix.

In [None]:
np.arange(0, 15).reshape(3, 5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

Generating random integers

In [None]:
np.random.randint(1, 7, 10)

array([1, 6, 3, 6, 5, 6, 5, 2, 1, 2])

## Task 3:
Let us play a game. You roll two dice together. If the sum is 7, you win. Otherwise, you lose. Play this game 10 times and report how many times you won and how many times you lost. Create a function to check if the outcome is a winner or loser. 

In [None]:
dice1 = np.random.randint(1, 7, 10)
dice2 = np.random.randint(1, 7, 10)
result = dice1 + dice2

Random Normal Distribution

In [None]:
mu, sigma = 3.0, 0.4

In [None]:
s = np.random.normal(mu, sigma, 10)

In [None]:
s

array([2.8879059 , 3.51432056, 2.85494832, 3.36307895, 2.99545094,
       2.99704839, 2.79751613, 3.0048883 , 3.25002385, 2.6689145 ])

In [None]:
A = np.array( [[1,1],
               [0,1]] )
B = np.array( [[2,0],
               [3,4]] )

In [None]:
A * B   # pair-wise

array([[2, 0],
       [0, 4]])

In [None]:
np.dot(A, B) # matrix multiplication

array([[5, 4],
       [3, 4]])

## Pandas Library

In data science, most of the times we will be working with structured, tabular datasets. Some examples include csv files, Excel files, tab delimited files.

There is a very powerful library called *Pandas* that can help us with loading, parsing, and examining data. The first step is to load the library since it's already installed in Google Colab.

In [None]:
import pandas as pd # pd is an alias for pandas

In [None]:
?pd

Pandas has a set of useful commands to read in data. Look at the pandas documentation for inputting data:
https://pandas.pydata.org/docs/user_guide/io.html

Let's read in a csv file containing data about Titanic passengers.

In [None]:
titanic = pd.read_csv("https://an-vistra.s3.us-west-1.amazonaws.com/data/titanic.csv") # read_csv expects to see a csv file. 

In [None]:
# print out the data type of titanic variable
type(titanic)

pandas.core.frame.DataFrame

In [None]:
titanic

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.00,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.00,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.50,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.50,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.00,0,0,2670,7.2250,,C,,,


In [None]:
# it has understood that the data has a header row, and it has automatically parsed the columns 
titanic.shape

(1309, 14)

In [None]:
titanic.describe()  # give me summary statistics for numerical columns

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881138,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.413493,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.17,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


In [None]:
titanic.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

In [None]:
titanic[['sex', 'pclass']]

Unnamed: 0,sex,pclass
0,female,1
1,male,1
2,female,1
3,male,1
4,female,1
...,...,...
1304,female,3
1305,female,3
1306,male,3
1307,male,3


## Filtering Dataframes

We can put a filter condition, which yields a Boolean for each row. We then need to output rows where the condition is true
For example, *titanic['age'] > 50* will return true or false for each row. If we want to see the row values where the condition is true, we need to do the following: *titanic[titanic['age'] > 50]*

In [None]:
titanic[titanic['age'] > 50]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1,0,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"
14,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0000,A23,S,B,,"Hessle, Yorks"
33,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S,8,,"Birkdale, England Cleveland, Ohio"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1068,3,0,"Nysveen, Mr. Johan Hansen",male,61.0,0,0,345364,6.2375,,S,,,
1225,3,0,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S,,261.0,
1235,3,0,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S,,,
1261,3,1,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S,15,,


In [None]:
titanic[titanic['survived'] == 1]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.00,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
5,1,1,"Anderson, Mr. Harry",male,48.00,0,0,19952,26.5500,E12,S,3,,"New York, NY"
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.00,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.00,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1261,3,1,"Turkula, Mrs. (Hedwig)",female,63.00,0,0,4134,9.5875,,S,15,,
1277,3,1,"Vartanian, Mr. David",male,22.00,0,0,2658,7.2250,,C,13 15,,
1286,3,1,"Whabee, Mrs. George Joseph (Shawneene Abi-Saab)",female,38.00,0,0,2688,7.2292,,C,C,,
1290,3,1,"Wilkes, Mrs. James (Ellen Needs)",female,47.00,1,0,363272,7.0000,,S,,,


If you want the count of distinct values in each column:

In [None]:
titanic["survived"].value_counts()

0    809
1    500
Name: survived, dtype: int64

## Task 4

Run the following queries:

1. Filter the data to those rows where *pclass* is equal to 1
2. Filter the data to those rows where *pclass* is equal to 1 AND *survived* is equal to 1. Hint use the "and" operator to join two conditions

How to apply a function to each column of a dataframe.

In [None]:
def createAgeRange(age):
  if (age < 30.0):
    ageRange = 'Young'
  elif ((age >= 30.0) and (age < 50.0)):
    ageRange = 'Middle-Aged'
  else:
    ageRange = 'Old' 

  return ageRange

In [None]:
ageRange = titanic['age'].apply(lambda x: createAgeRange(x))

In [None]:
ageRange

0             Young
1             Young
2             Young
3       Middle-Aged
4             Young
           ...     
1304          Young
1305            Old
1306          Young
1307          Young
1308          Young
Name: age, Length: 1309, dtype: object

In [None]:
titanic['ageRange'] = ageRange

In [None]:
titanic['ageRange'].value_counts()

Young          569
Old            373
Middle-Aged    367
Name: ageRange, dtype: int64

## Grouping and Aggregation

One of the important data science tasks is to group by certain columns (also called attributes) and then find aggregate statistics of the remaining columns for each group of the grouping column.

In [None]:
titanic.groupby("pclass").size()

pclass
1    323
2    277
3    709
dtype: int64

In [None]:
titanic.groupby(['pclass', 'sex']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,name,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,ageRange
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,female,144,144,133,144,144,144,144,121,142,138,0,121,144
1,male,179,179,151,179,179,179,179,135,179,63,35,168,179
2,female,106,106,103,106,106,106,106,13,106,86,1,104,106
2,male,171,171,158,171,171,171,171,10,171,26,30,157,171
3,female,216,216,152,216,216,216,216,7,216,95,7,63,216
3,male,493,493,349,493,493,493,492,9,493,78,48,132,493


In [None]:
titanic.groupby(['pclass'])["fare"].mean()

pclass
1    87.508992
2    21.179196
3    13.302889
Name: fare, dtype: float64

In [None]:
titanic.groupby(['pclass'])["fare"].agg(["mean", "std"])

Unnamed: 0_level_0,mean,std
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,87.508992,80.447178
2,21.179196,13.607122
3,13.302889,11.494358


## Task 5:

Using the titanic dataset, run the following queries:

1. Group by the ageRange column and find the *count* of survivors of each group. 
2. Group by the ageRange column and find the *percent* of survivors of each group.

## Task 6:
# Working with a different Dataset

In this task, you will do the following after reading in a cars dataset from https://an-utd-python.s3.us-west-1.amazonaws.com/Car_sales.csv

Run the following queries on the cars dataset that you loaded above:

1. Give a breakdown of count of models grouped by manufacturer. Sort your answer in decreasing order of count of models
2. Find the most expensive car for each manufacturer
3. Find average fuel_efficiency for each vehicle_type
4. Using the columns *Price_in_thousands* and *Sales_in_thousands*, create a new column called *Total_Revenue_in_thousands*