<a href="https://colab.research.google.com/github/adampick99/study-notes/blob/main/AppliedDataScienceInPython_Notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applied Data Science in Python - University of Michigan

Start: 17/06/2023 14:20

## Week 1 - Fundamentals of Data Manipulation with Python

This course will introduce the learner to the basics of the python programming environment, including fundamental python programming techniques such as **lambdas, reading and manipulating csv files, and the numpy library**. The course will introduce data manipulation and cleaning techniques using the popular python pandas data science library and introduce the abstraction of the Series and
DataFrame as the central data structures for data analysis, along with tutorials on how to use functions such as **groupby, merge, and pivot tables** effectively. By the end of this course, students **will be able to take tabular data, clean it, manipulate it, and run basic inferential statistical analyses.**

## Python Functions

You can create a function with optional parameters by defining the input as var = None. This declaration must happen AFTER the required variables in the def.

In [None]:
def add_numbers(x, y, z=None): # This can add a maximum of 3 numbers, and a minimum of 2.
  if z == None:
    return x+y
  else:
    return x+y+z

In [None]:
add_numbers(2,4)

6

The implication of assigning a variable z = None, it means that this is the default value used if this parameter isn't passed through the function. It can be manually chosen as None or you can simply not enter any value.

The type() function can be used to check the data type of a python object. Tuples, lists and dictionaries are what we care about.

A tuple is a sequence of variables which is immutable (can't be changed after declaring).

Tuples are declared by using ( ), lists by [ ], and dictionaries by { }, for example: {key: value, key2: value}.

key, key2 can be strings, float, integers, you name it (I think).

## Manipulating Strings
### Slicing

In Python, the indexing operator [0, 5] will give 5 different outputs, but the value in index [5] won't be output, it's exclusive, not inclusive.

This is the same for going backwards. [-5:-3] will output the 5th last and 4th last values, but not the 3rd last as it's exclusive.

[:3] is everything up to the 3rd index, without the 3rd index variable.

[3:] is everything from and including the 3rd index value (4th value as indexing begins at 0).

### Split - Regular Expression Evaluation (RegEx)
(Split strings based on substrings)



In [None]:
firstname = 'Christopher'
lastname = 'Brooks'

print(firstname + ' ' + lastname)
print(firstname * 3)
print('Chris' in firstname) # The in operator can be used to search in a string, and here we're searching for 'Chris'

Christopher Brooks
ChristopherChristopherChristopher
True


In [None]:
firstname = 'Christopher Arthur Hansen Brooks'.split(' ')[0]  # [0] selects the first element of the list
lastname = 'Christopher Arthur Hansen Brooks'.split(' ')[-1]  # [-1] selects the last element of the list
print(firstname)
print(lastname)

Christopher
Brooks


Make sure to convert objects to strings before concatenating using the + 'string' method.

## Dictionaries

Dictionaries are objects that store keys and values. They are defined by using curly brackets { }. You can use the **.values()** method to output just the values, not the keys. To output just the keys, use **.keys()**. To output both, use **.items()**.

You retrieve the associated value assigned to a label/key by simply indexing at the label/key value. So if dict = {'Jeff': 21} is the dictionary, then dict['Jeff'] returns 21.

In [None]:
x = {'Christopher Brooks': 'brooksch@umich.edu', 'Bill Gates': 'billg@microsoft.com'}

for key, value in x.items():
    print(key)
    print(value)

# This will output each key, followed by it's corresponding value.

Christopher Brooks
brooksch@umich.edu
Bill Gates
billg@microsoft.com


## Tuples

You can create tuples (not dictionaries) from already created variables. Remember, these are immutable. See below:

In [None]:
# Say we have a list
list = ('Tyrion', 'Lannister', 'Targaryen')

# We turn this list into a tuple by declaring the tuple key name and feeding the list in.
fore, sur, house = list

## More on String Manipulation

We can use the format() function to fill in string templates using other objects. Take a look:



In [None]:
sales_record = {
    'price': 3.24,
    'num_items': 4,
    'person': 'Chris'}

sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}'

# The format function works on a template string to replace sections of the string with values/strings from another Python object (sales_record here)
print(sales_statement.format(sales_record['person'],
                             sales_record['num_items'],
                             sales_record['price'],
                             sales_record['num_items'] * sales_record['price']))

Chris bought 4 item(s) at a price of 3.24 each for a total of 12.96


## Reading and Writing CSV Files

Let's import our datafile mpg.csv, which contains fuel economy data for 234 cars.

mpg : miles per gallon
class : car classification
cty : city mpg
cyl : # of cylinders
displ : engine displacement in liters
drv : f = front-wheel drive, r = rear wheel drive, 4 = 4wd
fl : fuel (e = ethanol E85, d = diesel, r = regular, p = premium, c = CNG)
hwy : highway mpg
manufacturer : automobile manufacturer
model : model of car
trans : type of transmission
year : model year

In [None]:
import csv

#with open('datasets/mpg.csv') as csvfile:
    #mpg = list(csv.DictReader(csvfile))

#mpg[:3]  # The first three dictionaries in our list.

FileNotFoundError: ignored

**csv.Dictreader has read in each row of our csv file as a dictionary.** len shows that our list is comprised of 234 dictionaries. The column i becomes the key for the i'th row value.

## Advanced Python Objects

### Object Oriented Programming

Python Documentation will explain this much better, but OOP isn't really worth studying yet for Data Science.

### Maps - map()

The map function is one of the bases for functional programming in Python.

map(function, iterable, ...)

This function allows you to apply a function over each iterate of the iteration.

The map function returns a map object, not just the output values, so it's not necessarily outputting what you want to see immediately - you can iterate over the map object to find each value however. It's commonly used when working with Big Data because it has good memory management.


## List Comprehensions

List comprehension offers a shorter syntax when you want to create a new list based on the values of an existing list.

Example:

Based on a list of fruits, you want a new list, containing only the fruits with the letter "a" in the name.

Without list comprehension you will have to write a for statement with a conditional test inside:

In [None]:
fruits = ["apple", "banana", "cherry", "kiwi", "mango"]
newlist = []

for x in fruits:
  if "a" in x:
    newlist.append(x)

print(newlist)

With list comprehension you can do all that with only one line of code:

In [None]:
fruits = ["apple", "banana", "cherry", "kiwi", "mango"]

newlist = [x for x in fruits if "a" in x] # The first bit is the value output into the list
# Second bit after 'for' is what you iterate over
# And you can add conditions
# You can nested for loop using multiple fors. i.e by using for x in fruits1 for x in fruits2 to
# iterate over each fruit2 for each fruit1 etc.

print(newlist)

## NumPy

In [None]:
# We can pass a list into the np.array function to create a numpy array. This can be a list of lists,
# creating a multi-dimensional array (functions as a matrix)

import numpy as np

b = np.array([[1,2,3],[4,5,6]])
b

# Check the dimensions

b.shape

(2, 3)

In [None]:
# Check data type inside the np.array

b.dtype

Integers and float values are allowed in numpy arrays.

In [None]:
# We can create placeholder arrays, in the event we know the shape of the array we want, but don't yet have
# values to fit into said array. We use the np.zeros() function here.

c = np.zeros((2,4))
print(c)

e = np.ones((2,4))
print(e)

f = np.random.rand(2,3)
print(f)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]]
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]]
[[0.23249013 0.9378266  0.65831739]
 [0.42762632 0.45489757 0.84012978]]


### np.arange and np.linspace

In [None]:
g = np.arange(2, 50, 2) # Start, end, increment
g

h = np.linspace(0, 15, 3) # Start, end, number of values, that's how linspace differs.

# So np.arange uses an increment, whereas linspace chooses the number of values in the array as the 3rd param.

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34,
       36, 38, 40, 42, 44, 46, 48])

## Array Operations

In [None]:
# The big benefit of numpy is that arithmetic operations happen element-wise, so our
# functions are typically always vectorised.

# Let's see the effects here.
m1 = np.array([[1,2,3],[4,5,6]]) # Remember a matrix will be a list of lists. So [[],[]] structure.
m2 = np.array([[5,4,3],[2,1,0]])

m1 + m2 # Voila

# This could, for example, be used to convert fahrenheit to celsius easily. Just do (m - 31) * 5/9.

array([[6, 6, 6],
       [6, 6, 6]])

### The Boolean Array

Want to create an array that provides a boolean answer to a condition? Say we want to know how many values of our matrix are less than 5, we can do that easily, and the output will be a matrix.

In [None]:
m3 = np.array([[1,6,4],[4,3,2]])
m3 < 5

array([[ True, False,  True],
       [ True,  True,  True]])

### Matrix Multiplication

NumPy supports matrix product, not just elementwise multiplication of matrices. Just use '@' instead of '*'.

You can ofcourse use the .shape function to confirm the columns of A match with the rows of B in A@B for example.

### What is 'upcasting'?

Upcasting is a phenomenon in NumPy where the type of a resulting array will correspond to the more general/less restricted of the two types. The best example is if we add an array of integers to an array of floats, even if some of the floats could in fact be converted themselves to integers, the resulting array will be wholly float values.

## Aggregation Functions


In [None]:
# We have many aggregation functions

array = np.array([1,2,3,4,5,6,7])
print(array.sum())
print(array.max())
print(array.min())
print(array.mean())

28
7
1
4.0


# Rest of notes on Week 1:

You can use .reshape() to convert an array to different dimensions, like a list into a matrix.

PIL is Python Imaging Library, useful for outputting images.

- You can convert images into numpy arrays.

np.full() is basically np.ones or np.zeros but you choose the value that fills the array.

.astype() can be used to convert an object into another data type, if you think you know better.

## Indexing, Slicing and Iterating

### Array Indexing

Indexing will work differently since you may need to specify extra dimensions. So array[1,1] gives us the element in the 2nd row, 2nd column.

There's another way of indexing, like a sort of 'zip' indexing. Where you input the list of rows you want to pull from, and the list of columns you want to pull from. And numpy will elementwise pick elements from the array using the index [i,i]from each list. See below


In [None]:
a = np.array([[1,2],[3,4],[5,6]])

# Try the normal method of indexing, say we wanmt 1, 4 and 6.

print(np.array([a[0,0], a[1,1], a[2,1]]))

# Lets try it the other way, we want [0,1,2] as the list of rows, and [0,1,1] as the columns.
print(np.array(a[[0,1,2], [0,1,1]]))

# Voila, same result.

[1 4 6]
[1 4 6]


### Boolean Indexing/Masking

This is one of the greatest tools, that constitutes the largest tool you use within the pandas toolkit, from NumPy.

In [None]:
# Say we want to return all values from the np.array above (a), that are greater than 5.

print(a[a>5])

# What's happening here is that the output of a>5 (the boolean array) is being masked onto
# the original array a to only collect the values in the position where a>5 matrix is TRUE.

# See:
print(a)
print(a>5)

[6]
[[1 2]
 [3 4]
 [5 6]]
[[False False]
 [False False]
 [False  True]]


### Slicing

This works largely as before, for multi-dimensional arrays it works similarly too. Let's see multi-dimensional slicing below:

In [None]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
a[:2, 1:3] # The first two rows, but only from the 2nd, 3rd columns (remember index 3 not included)

array([[2, 3],
       [6, 7]])

#### Passing by Reference

Recall the difference between passing by reference and passing by value. Any variable 'passed by reference' will mirror any edits back to the original variable it originated from.

An example of an object that is passed by reference is a slice of an array. Modifying the sub-array will modify the original array.

This can sometimes get confusing when you have a slice that doesn't contain a particular row/column, as if we create a sub-array without row 1 of an original array and make an edit to subarray[0,0] = 2, this would edit the original array value in index array[1,0] = 2.

### Trying NumPy with Datasets


In [None]:
# To load a dataset in Numpy, we can use the genfromtxt() function. We can specify data file name, delimiter
# (which is optional but often used), and number of rows to skip if we have a header row, hence it is 1 here

# The genfromtxt() function has a parameter called dtype for specifying data types of each column this
# parameter is optional. Without specifying the types, all types will be casted the same to the more
# general/precise type

wines = np.genfromtxt("datasets/winequality-red.csv", delimiter=";", skip_header=1)
wines

# Manipulating Text with Regular Expression (RegEx)

Pattern Matching in Strings using Regular Expression (regexes) is very useful as a Data Scientist if you want to analyse text data for patterns, spot complex patterns in source data, or to clean text data using string splitting to separate patterns in the data.



In [None]:
# First import the re module, the Python standard regex library.
import re

# Here's an example
text = 'This is a string, how great.'

# Search for 'great' in the string
if re.search("great", text): # 1st param is the pattern, 2nd the data to search
  print("Something was great!")
else:
  print("Aww, it wasn't so great.")

# So search has a boolean output here. Very useful.

Something was great!


In [None]:
# We don't just have to search for conditionals, we can segment strings. The method that
# regex uses here is called tokenizing - this is a core activity in Natural Language Processing (NLP)

# The findall() and split() functions will parse the string for us and return chunks. See example:
text = 'Amy works diligently. Amy gets good grades. Our student Amy is successful.'

re.split('Amy', text)

# You'll notice split returned an empty string, followed by statements about Amy. We can use
# findall() to county how many times we talked about Amy.
print(re.findall('A[mo]y', text))
len(re.findall('Amy', text))

np.reshape(20,)

['Amy', 'Amy', 'Amy']


3

In [None]:
# Recap:

re.search() # Boolean search for a particular string/pattern in an input string.

re.split() # Separates the strings split by a delimiter and outputs them in a list

re.findall() # Finds all instances of a particular string. Lots of different wildcards/options
# to narrow the search using logic.

In [None]:
# Extra re.findall() methods:

grades = 'ACAAAABCBCBAA'

print(re.findall("[AB]", grades)) # The [] acts as an OR operator, called set operator.

print(re.findall("^A", grades)) # This will find any string starting with A, called caret operator.

print(re.findall("[^A]", grades)) # 'Negating' all A's in the string, combining set and caret operator. Essentially a complement.

print(re.findall("[A][B-E]", grades)) # Any string that starts with A, followed by a B through to E (so effectively BCDE).

print(re.findall("AB|AC", grades)) # Another way of writing the above (Pipe operator |)


NameError: ignored

### Regex - Quantifiers

Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic quantifier is expressed as e{min,max}, where e is the expression or character we are matching, min is the minimum number of times the character can be matched, and max is the maximum number of times the character can be matched.

'\w' means any character, including digits and numbers. (w for Wildcard I assume), so you don't have to use '[a-zA-Z]' to find all letters.

To make quantifiers look nicer, and to shorten the syntax. We can use * as a kind of {0, infinity} quantifier, i.e match the character any number of times. ? or + to match one or more times, so they must match at least once.

There are also shorthand ways to signal character, digit, whitespace etc:

Uppercase form negates the lowercase form.

- \w any word character, \W any non-word character
- \s white space, \S anything but white space, strips whitespace.
- \d digit character, \D non-digit character.

I'm sure there are others.



#### TESTING STUFF BELOW:

In [None]:
# Testing this to see if this is passed by referencing

a = np.array([[0,1,2],[1,2,3]])
b = a # Passed by referencing

b[0,0] = 5
a

array([[5, 1, 2],
       [1, 2, 3]])

In [None]:
import re
s = 'ACAABAACAAAB'
len(re.findall('A{1,2}', s)) # Looking for instances of one A or two A's consecutively.



5