# Lab 0: Jupyter and Basic Python

In this lab, you will load and perform a basic analysis of a text using python in a Jupyter interactive notebook.

Follow the instruction and run the code cells. Make sure you understand what happens in every stage. <br>

<b>Helpful Links:</b>

  https://docs.scipy.org/doc/numpy/reference/
  
  https://docs.scipy.org/doc/scipy/reference/
  
  https://pandas.pydata.org/pandas-docs/stable/reference/index.html
  
  http://www.utc.fr/~jlaforet/Suppl/python-cheatsheets.pdf


## Working with Google Colab

### Code Cells

Deleting, creating, editing, running and interupting code cells.
Edit mode, command mode.

In [None]:
print("Hello World") #print to standard output

Hello World


In [None]:
"Hello World" # output without printing

'Hello World'

In [None]:
my_list = [1, 2, 3, 4, 5]
print(my_list)

[1, 2, 3, 4, 5]


In [None]:
my_list

[1, 2, 3, 4, 5]

In [None]:
"Hello World"
my_list # only last line is outputed in a cell

[1, 2, 3, 4, 5]

In [None]:
# with print, we can show multiple outputs in a cell
print("Hello World")
print(my_list)

Hello World
[1, 2, 3, 4, 5]


In [None]:
"Hello World"
my_list
my_second_list = [6, 7, 8, 9, 10] # only last line is outputed. No output in this case.

In [None]:
my_second_list

[6, 7, 8, 9, 10]

In [None]:
my_second_list; # surpressing output with semi-colons

### Markdown Cells

Hello, I am just annotating my notebook.

Markdown is a popular markup language that is a superset of HTML.

"\*" was used to italize this: *I love Data Science*

"\*\*" was used to bolden this: **I love Data Science**

***Play around with Colab's markdown tools*** 😀.



## Getting the data
Following is a code snippet that download a file and save it locally. We will download the poem "HOWL" by Allen Ginsberg and save it locally to file named "howl.txt"

In [None]:
DOWNLOAD_URL = "http://www.everyday-beat.org/ginsberg/poems/howl.txt"
SAVE_URL = "howl.txt"



In [None]:
from urllib.request import urlopen

downloadedText = urlopen(DOWNLOAD_URL).read().decode('utf-8')
with open(SAVE_URL, "w") as file:
    file.write(downloadedText)

-----------
## Loading the data
* First we open the file using *open()*
* **Make sure the file path is correct**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
with open("howl.txt", "r") as poemFile:
    poemText = poemFile.read()


----------
## Review the data
Lets print the first 500 character of the text

In [None]:
print(poemText[0:500])


                          HOWL

                    For Carl Solomon 

                           I 

       I saw the best minds of my generation destroyed by 
              madness, starving hysterical naked, 
       dragging themselves through the negro streets at dawn 
              looking for an angry fix, 
       angelheaded hipsters burning for the ancient heavenly 
              connection to the starry dynamo in the machin- 
              ery of night, 
       who poverty and tatters 


----------
Now, lets print the beginning of the second section. It start at character 16692:

In [None]:
print(poemText[16692:17500])

                           II 

       What sphinx of cement and aluminum bashed open 
              their skulls and ate up their brains and imagi- 
              nation? 
       Moloch! Solitude! Filth! Ugliness! Ashcans and unob 
              tainable dollars! Children screaming under the 
              stairways! Boys sobbing in armies! Old men 
              weeping in the parks! 
       Moloch! Moloch! Nightmare of Moloch! Moloch the 
              loveless! Mental Moloch! Moloch the heavy 
              judger of men! 
       Moloch the incomprehensible prison! Moloch the 
              crossbone soulless jailhouse and Congress of 
              sorrows! Moloch whose buildings are judgment! 
              Moloch the vast stone of war! Moloch the stun- 
              ned governments! 
     


---------
## Counting the number of lines in the story
We start by splitting the text to a list of lines using *splitlines()*

In [None]:
poemLines = poemText.splitlines()

---------
Now we print the first 10 lines:

In [None]:
poemLines[0:10]

['',
 '                          HOWL',
 '',
 '                    For Carl Solomon ',
 '',
 '                           I ',
 '',
 '       I saw the best minds of my generation destroyed by ',
 '              madness, starving hysterical naked, ',
 '       dragging themselves through the negro streets at dawn ']

----------
Last, we count the lines using *len()* method:

In [None]:
print("The number of lines in the text is:", len(poemLines))

The number of lines in the text is: 445


---------
## Counting the number of words in the story
We start by splitting the text to a list of words using *split()*

In [None]:
poemWords = poemText.split()

---------
Now we print the first 10 words:

In [None]:
poemWords[:10]

['HOWL', 'For', 'Carl', 'Solomon', 'I', 'I', 'saw', 'the', 'best', 'minds']

----------
Last, we count the words using *len()* on the list of words:

In [None]:
print("The number of words in the text is:", len(poemWords))

The number of words in the text is: 2957


## Counting the number of occurrences of a word

In [None]:
# first, initalize a words counter to 0
wordsCounter = 0

# for every word in the list storyWords
for word in poemWords:
    # strip non-alphanumeric character, and convert to lower case
    if word.strip(""" ,.*()[]!@#$%^&*{}?'`"-""").lower() == "who":
        wordsCounter += 1

In [None]:
print("The number of occurrences of 'Who' is:", wordsCounter)

The number of occurrences of 'Who' is: 68


## Counting the number of lines starting with a word

In [28]:
linesCounter = 0
for line in poemLines:
    if line.lower().strip().startswith("who"):
        linesCounter += 1

In [None]:
print("The number of lines that start with the word 'Who' is:", linesCounter)

The number of lines that start with the word 'Who' is: 64


## Counting the number of unique words

In [None]:
# remove non-alphanumeric characters and convert to lower case
cleanWords = [word.strip(""" ,.*()[]!@#$%^&*{}?'`"-""").lower()
               for word in poemWords]

In [None]:
print("number of different words: ", len(list(set(cleanWords))))

number of different words:  1318


## Using *Counter* collection
We can use *Counter* to count the frequency of every item in a list.

In [None]:
from collections import Counter
wordsCounter = Counter(cleanWords)

---------
What is the number of occurrences of the word 'Christmas'?

In [None]:
wordsCounter["who"]

68

---------
What are the 10 most common words?

In [None]:
wordsCounter.most_common(10)

[('the', 220),
 ('of', 139),
 ('and', 120),
 ('in', 111),
 ('who', 68),
 ('to', 58),
 ('with', 48),
 ('moloch', 39),
 ('a', 38),
 ('you', 32)]

-------
## Finding a phrase in the text

Let find the first occurrence of the phrase "Canada" in the text.

In [None]:
poemText.find("Canada")

1845

-------
Lets examine the text around this position

In [None]:
print(poemText[1800:1900])

 mind leaping toward poles of 
              Canada & Paterson, illuminating all the mo- 
          


## List



A list is a collection which is ***ordered*** and ***mutable(changeable)***.

In Python lists are written with square brackets [].

In [None]:
greetingsList = ["Hello","Hi","How are you?"]
#Change value of a specific index
greetingsList[1] = "What's up?"
#When printing a range, the first value is included, the second value is excluded
print(greetingsList[0:2])

['Hello', "What's up?"]


In [None]:
#Append item to end of list
greetingsList.append("Hey")
#Insert item at specified index
greetingsList.insert(2, "Yo")
print(greetingsList)

['Hello', "What's up?", 'Yo', 'How are you?', 'Hey']


You can sort a list by calling sort()

In [None]:
greetingsList.sort()
print(greetingsList)

['Hello', 'Hey', 'How are you?', "What's up?", 'Yo']


Sort a list with numerical values

In [None]:
nums = [3,2,4,5,1]
nums.sort()
nums

[1, 2, 3, 4, 5]

Removing items from a list

In [None]:
#Remove item from list by index
del greetingsList[2]

print(greetingsList)
print(greetingsList[2])

['Hello', 'Hey', "What's up?", 'Yo']
What's up?


In [None]:
#Remove item from list by value
greetingsList.remove("Hello")

print(greetingsList)

['Hey', "What's up?", 'Yo']


Merging 2 lists using list1+list2

In [None]:
#Join two lists
foreignGreetingsList = ["Bonjour","Ciao"]
moreGreetings = greetingsList + foreignGreetingsList
print(moreGreetings)

['Hey', "What's up?", 'Yo', 'Bonjour', 'Ciao']


Alternatively, call list1.extend(list2)

In [None]:
greetingsList.extend(foreignGreetingsList)
print(greetingsList)

['Hey', "What's up?", 'Yo', 'Bonjour', 'Ciao']


Make a copy of the list

In [None]:
#Make a copy of a list - wrong way
moreGreetingsCopy = moreGreetings
moreGreetingsCopy[0] = "Hola"
print(moreGreetingsCopy[0] +' '+ moreGreetings[0])

Hola Hola


In [None]:
#Make a copy of a list - right way
moreGreetings = ['Hey', "What's up?", 'Yo', 'Bonjour', 'Ciao']
moreGreetingsCopy = moreGreetings.copy()
moreGreetingsCopy[0] = "Hola"
print(moreGreetingsCopy[0] +' '+ moreGreetings[0])

Hola Hey


### List comprehensions

In [None]:
#easier to read than map, not lazily evaluated. outputs a list
[num**2 for num in range(1,6)]

[1, 4, 9, 16, 25]

In [None]:
[num**2 for num in range(1,6) if num % 2 != 0]

[1, 9, 25]

In [None]:
[num**2 if (num % 2 != 0) else 0 for num in range(1,6)]

[1, 0, 9, 0, 25]

***Exercise:***

Using list comprehensions, write a code that takes a list of names and returns a list of "Hello *name*" for each *name* that has more than 6 characters.


In [31]:
names = ["name1", "longname2", "verylongname3", "name4", "longlonglong5"]
#Desired output : ['Hello longname2', 'Hello verylongname3', 'Hello longlonglong5']

#Write your code here
["Hello " + name for name in names if len(name) > 6 ]

['Hello longname2', 'Hello verylongname3', 'Hello longlonglong5']

## Tuple

A tuple is a collection which is ***ordered*** and ***immutable(unchangeable)***. Can't append or remove elements from a tuple like you can with a list.

In Python tuples are written with round brackets ().


In [None]:
TA_tuple = ('Jolomi','Joe', 'Wenhao', 'Esmat', 'George')
print(TA_tuple)

('Jolomi', 'Joe', 'Wenhao', 'Esmat', 'George')


***Ordered***: you can access tuple elements by indexing.

In [None]:
TA_tuple[0]

'Jolomi'

In [None]:
#Negative indexing, last item in tuple has index -1
TA_tuple[-1]

'George'

***Immutable***: You can not modify an existing elements in a tuple or append to it.

In [None]:
# This will throw an error
# TA_tuple[0] = 'Emma'

Iterate through a tuple

In [None]:
for TA in TA_tuple:
  print(TA)

Jolomi
Joe
Wenhao
Esmat
George


In [None]:
print(len(TA_tuple))

5


You can cast a tuple to a list by using list(tuple)

In [None]:
TA_list = list(TA_tuple)
print(TA_list)

['Jolomi', 'Joe', 'Wenhao', 'Esmat', 'George']


## Set

A set is a collection which is ***unordered*** and ***mutable(changable)***. It ***cannot*** have duplicate values.

In Python sets are written with curly brackets {}.

In [None]:
petSet = {"Dog","Cat","Dog","Fish"}
for x in petSet:
  print(x)

Cat
Dog
Fish


***Unordered***: You can not access elements inside a set by indexing.

In [None]:
#this throws an error
# petSet[0]

***Mutable***: You can add or remove items in a set after creation

In [None]:
# Once a set is created, you cannot change its items , but you can add or remove new items.
petSet.add("Bird")
petSet.remove("Fish")
print(petSet)

{'Cat', 'Dog', 'Bird'}


In [None]:
#Check if set contains a value
x = "Hamster"
if x in petSet:
  print(x + " in petSet")
else:
  print(x + " not in petSet")

Hamster not in petSet


Clear the content in a set (to get an empty set)

In [None]:
#Clear a set
petSet.clear()
print(petSet)

set()


Delete the set (frees up the memory)

In [None]:
#delete a set
del petSet
#print(petSet)

You can cast a set to a list

In [None]:
petList = list({'Fish','Cat','Dog'})
print(petList)

['Cat', 'Dog', 'Fish']


## Dictionary

A dictionary is a collection of **key-value pairs** which is ***unordered***, ***mutable(changeable)*** and ***indexed***.

Dictionaries can't have duplicate keys.

In Python dictionaries are written with curly brackets {key:value}.

In [None]:
#initialize blank dictionary
someDict = {}
#initialize dictionary with two key-value pairs
someDict = {'university':'UofT', 'course':'MIE223'}
print(someDict)

{'university': 'UofT', 'course': 'MIE223'}


***Unordered***

In [None]:
someDict1 = {'course':'MIE223', 'university':'UofT'}
print(someDict1 == someDict)

True


***Indexed***: you can access a value of by indexing the key

In [None]:
print(someDict['university'])

UofT


***Mutable***: You can add/remove key-value pairs after creation. You can also change values for keys.

In [None]:
someDict['year']='2023'
print(someDict)

{'university': 'UofT', 'course': 'MIE223', 'year': '2023'}


In [None]:
someDict['year']='2024'
print(someDict)

{'university': 'UofT', 'course': 'MIE223', 'year': '2024'}


In [None]:
del someDict['year']
print(someDict)

{'university': 'UofT', 'course': 'MIE223'}


Iterate through a dictionary

In [None]:
#Iterate through the dictionary keys and print each one
for x in someDict.keys():
  print(x)

university
course


In [None]:
#Iterate through the dictionary values and print each one
for x in someDict.values():
  print(x)

UofT
MIE223


In [None]:
#Iterate through the dictionary key-value pairs
for key, value in someDict.items():
  print('{}:{}'.format(key,value))

university:UofT
course:MIE223


In [None]:
#Check if key exists in dictionary, using keyword 'in'
search_keys = ['course', 'professor']
for k in search_keys:
  if k in someDict:
    print("""key "{}" found with value {}""".format(k,someDict[k]))
  else:
    print("""key "{}" not found""".format(k))

key "course" found with value MIE223
key "professor" not found


In [None]:
# The value of a key can be a list
TA_list = ['Jolomi','Joe', 'Wenhao', 'Esmat', 'George']
someDict['TAs'] = TA_list
print(someDict)

{'university': 'UofT', 'course': 'MIE223', 'TAs': ['Jolomi', 'Joe', 'Wenhao', 'Esmat', 'George']}


### Dictionary comprehension

Create a new dictionary containing name and age for people under 30 years old.

In [None]:
#Dictionary comprehension
#Collect keys if they have values meeting certain criteria, e.g., people's names whose age is under 30)?
peopleAge = {'Sam':25,'Luke':35,'Judy':50,'Paul':18}
peopleUnder30 = {key:value for (key,value) in peopleAge.items() if value <= 30}
for (name,age) in peopleUnder30.items():
  print('Name: {:<6}, Age:{:<5} '.format(name,age))

Name: Sam   , Age:25    
Name: Paul  , Age:18    


In [None]:
print(peopleUnder30)

{'Sam': 25, 'Paul': 18}


## Collections  Recap
There are four collection data types in the Python programming language:

* List is a collection which is ***ordered*** and ***mutable(changeable)***. Allows duplicate members.

* Tuple is a collection which is ***ordered*** and ***immutable(unchangeable)***. Allows duplicate members.

* Set is a collection which is ***unordered***, ***unidexed*** and ***mutable*** and . No duplicate members.

* Dictionary is a collection of key-value pairs which is ***unordered***, ***mutable*** and ***indexed*** (values are indexed by keys). No duplicate keys, but allows duplicate values.



## Range

The function **``range(x,y)``** will return the range between x to y-1.

In [None]:
for num in range(5):
    print(num)

0
1
2
3
4


however, range is lazily evaluated:

In [None]:
range(1,5)

range(1, 5)

In [None]:
print(range(1,5))

range(1, 5)


In [None]:
list(range(1,5))

[1, 2, 3, 4]

In [None]:
list(range(1,11,2))

[1, 3, 5, 7, 9]

***Exercise:***

1. The function range has a third argument z: ```range(x,y,z)```. Find out what does it do?


Answer: range starts from x iterates by z each step until y-1



##Map

map applies a function to all items in a list:

```map(func, input_list)```

In [None]:
items = [1, 2, 3, 4, 5]
squaredItems = []
for item in items:
    squaredItems.append(item**2)

print(squaredItems)

[1, 4, 9, 16, 25]


In [None]:
def square(num):
    return num**2

square(2)

4

In [None]:
items = [1, 2, 3, 4, 5]
squaredItems_bymap = list(map(square, items))
print(squaredItems_bymap)

[1, 4, 9, 16, 25]


In [None]:
squaredItems = list(map(square, range(1,6)))
print(squaredItems)

[1, 4, 9, 16, 25]


**map is lazily evaluated**

In [None]:
map(square, range(1,6))

<map at 0x7cb4d836a9b0>

In [None]:
print(map(square, range(1,6)))

<map object at 0x7cb4d836bcd0>


In [None]:
list(map(square, range(1,6)))

[1, 4, 9, 16, 25]

***Exercise***:

Write a function that takes a list of strings, and returns a list of the lengths of the strings.

For example, listLengths(["dog", "cat", "donkey", "elephant"] should return [3, 3, 6, 8]

In [33]:
listlength = [];

def listLengths(list):
    for item in list:
        listlength.append(len(item))
    return listlength

listLengths(["dog", "cat", "donkey", "elephant"])


[3, 3, 6, 8]

## Lambda functions

A short form of writing simple return-based functions

In [None]:
def longMultiply(num):
    return num * 2

shortMultiply = lambda num: num * 2

In [None]:
longMultiply(5)

10

In [None]:
shortMultiply(5)

10

In [None]:
longMultiply(5) == shortMultiply(5)

True

An easier way of writing maps:

In [None]:
for i in map(lambda x: x+1, range(1,5)):
  print (i)

2
3
4
5


In [None]:
list(map(lambda x: x+1, range(1,5)))

[2, 3, 4, 5]

In [None]:
list(map(lambda x: x+1, range(1,5))) == list(range(2,6))

True

writing the list length example with lambda functions:

In [None]:
list(map(lambda x: len(x), ["dog", "cat", "donkey", "elephant"]))

[3, 3, 6, 8]

***Exercise:***

Write the square example (above) using lambda functions

In [34]:
# Write your code here
list(map(lambda x: x**2, range(1,6)))

[1, 4, 9, 16, 25]

## Filter

In [None]:
items = [1, 2, 3, 4, 5]
oddItems = []
for item in items:
    if item % 2 != 0:
        oddItems.append(item)

print(oddItems)

[1, 3, 5]


In [None]:
def isOdd(num):
    if num % 2 != 0:
        return True
    else:
        return False

In [None]:
items = [1, 2, 3, 4, 5]
# if isodd returns true, then the item stays in list
oddItems = list(filter(isOdd, items))
print(oddItems)

[1, 3, 5]


Note that ``num % 2 != 0`` is a boolean expression (evaluates to True or False)

In [None]:
num = 4
num % 2 != 0

False

In [None]:
def isOdd(num):
    return num % 2 != 0

items = [1, 2, 3, 4, 5]
oddItems = list(filter(isOdd, items))
print(oddItems)

[1, 3, 5]


With lambda functions

In [None]:
oddItems = list(filter(lambda num: num % 2 != 0, range(1, 6)))
print(oddItems)

[1, 3, 5]


***Exercise:***

Write a code that takes in a list ```range(1,6)``` and output a list of the squared numbers that are also odd. Use only map, filter, and lambda functions

In [37]:
# your code here
list(filter(lambda x: x%2 != 0, map(lambda x: x**2, range(1,6))))


[1, 9, 25]

## String Formatting

In [None]:
year = 2024

In [None]:
"The year is " + str(year) + " right now"

'The year is 2024 right now'

In [None]:
"The year is %d right now" % (year)

'The year is 2024 right now'

In [None]:
"The year is {} right now".format(year)

'The year is 2024 right now'

In [None]:
years = [2022, 2023, 2024]

In [None]:
["The year is {}".format(y) for y in years]

['The year is 2022', 'The year is 2023', 'The year is 2024']

In [None]:
" | ".join(["The year is {}".format(y) for y in years])

'The year is 2022 | The year is 2023 | The year is 2024'

In [None]:
print("\n".join(["The year is {}".format(y) for y in years]))

The year is 2022
The year is 2023
The year is 2024


For more examples check: https://pyformat.info/

## NumPy library
Python library to support large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Core functionality is offered through the ndarray (n dimensional array).

In [None]:
import numpy as np
array = np.arange(8).reshape(2,4)
array

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [None]:
# Access the shape, number of dimensions, data type, and number of elements in our array
# Tip: use "print()" when you want a cell to output more than one thing, or you want to append text to your output, otherwise the cell will output the last object you call, as in the cell above
print ("Shape:", array.shape)
print ("Dimensions:", array.ndim)
print ("Data type:" , array.dtype.name)
print ("Number of elements:", array.size)

Shape: (2, 4)
Dimensions: 2
Data type: int64
Number of elements: 8


In [None]:
# Use Python list containing a set of numbers to create a NumPy array
mylist = [0, 1, 1, 2, 3, 5, 8, 13, 21]
myarray = np.array(mylist)
myarray

array([ 0,  1,  1,  2,  3,  5,  8, 13, 21])

In [None]:
# Can do it for nested lists as well, creating multidimensional NumPy arrays:
my2dlist = [[1,2,3],[4,5,6]]
my2darray = np.array(my2dlist)
my2darray

array([[1, 2, 3],
       [4, 5, 6]])

In [None]:
#Can also index and slice NumPy arrays like we would do with a Python list or another container object
array = np.arange(10)
print ("Originally: ", array)
print ("First four elements: ", array[:4])
print ("After the first four elements: ", array[4:])
print ("The last element: ", array[-1])

Originally:  [0 1 2 3 4 5 6 7 8 9]
First four elements:  [0 1 2 3]
After the first four elements:  [4 5 6 7 8 9]
The last element:  9


In [None]:
#Can index/slice multidimensional arrays, too.
array = np.array([[1,2,3],[4,5,6]])
print ("Originally: ", array)
print ("First row only: ", array[0])
print ("First column only: ", array[:,0])

Originally:  [[1 2 3]
 [4 5 6]]
First row only:  [1 2 3]
First column only:  [1 4]


In [None]:
#Range subindexing
# i:j:k where i is the starting index, j is the stopping index, and k is the step (k != 0).
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
print(x[1:7:2])

[1 3 5]


In [None]:
#Tranpose a matrix and verify by checking shape has changed
x = np.array([[11,22,33],[99,88,77]])
print(x.shape)
x = x.T
print(x.shape)
print(x)

(2, 3)
(3, 2)
[[11 99]
 [22 88]
 [33 77]]


In [None]:
#Invert a matrix
x = np.array([[11,22],[99,88]])
xI = np.linalg.inv(x)
print(xI)

[[-0.07272727  0.01818182]
 [ 0.08181818 -0.00909091]]


In [None]:
#Matrix multiply two arrays using @
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.matmul.html#numpy.matmul
x = np.array([[11,22],[99,88]])
xI = np.linalg.inv(x)
identityMatrix = x@xI
print(identityMatrix)
print(np.round(identityMatrix,decimals=2))

[[ 1.00000000e+00  1.04083409e-17]
 [-1.11022302e-15  1.00000000e+00]]
[[ 1.  0.]
 [-0.  1.]]


In [None]:
#Show type
type(identityMatrix)

numpy.ndarray

In [None]:
#Convert list to ndarray or ndmatrix
someList = [1,0,1]
print(type(someList))
somendarray = np.asarray(someList)
print(type(somendarray))
somendmatrix = np.asmatrix(someList)
print(type(somendmatrix))

<class 'list'>
<class 'numpy.ndarray'>
<class 'numpy.matrix'>


## Pandas library

Python library that relies on NumPy to provide high-performance, easy-to-use data structures and data analysis tools.

The primary tabular data structure is called a dataframe (df).

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [None]:
import pandas as pd
# Data source: https://www.kaggle.com/ankurchavda/coffee-beans-reviews-by-coffee-quality-institute
# Load data from public github repo (csv must be <25mb)
url = 'https://raw.githubusercontent.com/MIE223-2024/course-datasets/main/arabica_data.csv'
df = pd.read_csv(url, index_col=0)
#Print first 5 rows of dataframe
df.head()

Unnamed: 0,Acidity,Aftertaste,Aroma,Bag Weight,Balance,Body,Category.One.Defects,Category.Two.Defects,Clean Cup,Color,...,Producer,Region,Species,Sweetness,Uniformity,Variety,altitude_high_meters,altitude_low_meters,altitude_mean_meters,quality_score
0,8.75,8.67,8.67,60 kg,8.42,8.5,0,0,10.0,Green,...,METAD PLC,guji-hambela,Arabica,10.0,10.0,,2200.0,1950.0,2075.0,90.58
1,8.58,8.5,8.75,60 kg,8.42,8.42,0,1,10.0,Green,...,METAD PLC,guji-hambela,Arabica,10.0,10.0,Other,2200.0,1950.0,2075.0,89.92
2,8.42,8.42,8.42,1,8.42,8.33,0,0,10.0,,...,,,Arabica,10.0,10.0,Bourbon,1800.0,1600.0,1700.0,89.75
3,8.42,8.42,8.17,60 kg,8.25,8.5,0,2,10.0,Green,...,Yidnekachew Dabessa Coffee Plantation,oromia,Arabica,10.0,10.0,,2200.0,1800.0,2000.0,89.0
4,8.5,8.25,8.25,60 kg,8.33,8.42,0,2,10.0,Green,...,METAD PLC,guji-hambela,Arabica,10.0,10.0,Other,2200.0,1950.0,2075.0,88.83


In [None]:
#Prints the first 5 and last 5 rows, can be unwieldy/not useful
df

Unnamed: 0,Acidity,Aftertaste,Aroma,Bag Weight,Balance,Body,Category.One.Defects,Category.Two.Defects,Clean Cup,Color,...,Producer,Region,Species,Sweetness,Uniformity,Variety,altitude_high_meters,altitude_low_meters,altitude_mean_meters,quality_score
0,8.75,8.67,8.67,60 kg,8.42,8.50,0,0,10.00,Green,...,METAD PLC,guji-hambela,Arabica,10.00,10.00,,2200.00,1950.00,2075.00,90.58
1,8.58,8.50,8.75,60 kg,8.42,8.42,0,1,10.00,Green,...,METAD PLC,guji-hambela,Arabica,10.00,10.00,Other,2200.00,1950.00,2075.00,89.92
2,8.42,8.42,8.42,1,8.42,8.33,0,0,10.00,,...,,,Arabica,10.00,10.00,Bourbon,1800.00,1600.00,1700.00,89.75
3,8.42,8.42,8.17,60 kg,8.25,8.50,0,2,10.00,Green,...,Yidnekachew Dabessa Coffee Plantation,oromia,Arabica,10.00,10.00,,2200.00,1800.00,2000.00,89.00
4,8.50,8.25,8.25,60 kg,8.33,8.42,0,2,10.00,Green,...,METAD PLC,guji-hambela,Arabica,10.00,10.00,Other,2200.00,1950.00,2075.00,88.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1306,6.50,6.17,7.00,1 kg,6.17,6.67,0,4,0.00,Green,...,Omar Acosta,marcala,Arabica,8.00,8.00,Catuai,1450.00,1450.00,1450.00,68.33
1307,7.42,6.25,7.08,2 kg,6.75,7.25,0,20,6.00,,...,JUAN CARLOS GARCÍA LOPEZ,juchique de ferrer,Arabica,10.00,10.00,Bourbon,900.00,900.00,900.00,67.92
1308,6.67,6.42,6.75,69 kg,6.67,7.08,8,16,6.00,Blue-Green,...,COEB Koperativ Ekselsyo Basen,"department d'artibonite , haiti",Arabica,6.00,9.33,Typica,350.00,350.00,350.00,63.08
1309,6.25,6.33,7.25,1 kg,6.08,6.42,1,5,1.33,Green,...,Teófilo Narváez,jalapa,Arabica,6.00,6.00,Caturra,1100.00,1100.00,1100.00,59.83


In [None]:
#Summary statistics
df.describe()

Unnamed: 0,Acidity,Aftertaste,Aroma,Balance,Body,Category.One.Defects,Category.Two.Defects,Clean Cup,Cupper Points,Flavor,Moisture,Number of Bags,Sweetness,Uniformity,altitude_high_meters,altitude_low_meters,altitude_mean_meters,quality_score
count,1311.0,1311.0,1311.0,1311.0,1311.0,1311.0,1311.0,1311.0,1311.0,1311.0,1311.0,1311.0,1311.0,1311.0,1084.0,1084.0,1084.0,1311.0
mean,7.538764,7.403158,7.569527,7.523288,7.523387,0.450038,3.62624,9.83312,7.502441,7.523539,0.088963,153.678108,9.9109,9.839497,1808.751552,1759.456703,1784.104128,82.148825
std,0.319773,0.349945,0.31593,0.349174,0.293089,2.017571,5.482857,0.77135,0.428989,0.341817,0.047907,129.760079,0.454824,0.491508,8767.19233,8767.851565,8767.021485,2.893505
min,5.25,6.17,5.08,6.08,5.25,0.0,0.0,0.0,5.17,6.08,0.0,0.0,1.33,6.0,1.0,1.0,1.0,43.13
25%,7.33,7.25,7.42,7.33,7.33,0.0,0.0,10.0,7.25,7.33,0.09,14.0,10.0,10.0,1100.0,1100.0,1100.0,81.17
50%,7.5,7.42,7.58,7.5,7.5,0.0,2.0,10.0,7.5,7.58,0.11,170.0,10.0,10.0,1350.0,1310.64,1310.64,82.5
75%,7.75,7.58,7.75,7.75,7.67,0.0,4.0,10.0,7.75,7.75,0.12,275.0,10.0,10.0,1650.0,1600.0,1600.0,83.67
max,8.75,8.67,8.75,8.75,8.58,31.0,55.0,10.0,10.0,8.83,0.28,1062.0,10.0,10.0,190164.0,190164.0,190164.0,90.58


In [None]:
#Focus on specific column, similar to accessing a dictionary key
df['Species']

0       Arabica
1       Arabica
2       Arabica
3       Arabica
4       Arabica
         ...   
1306    Arabica
1307    Arabica
1308    Arabica
1309    Arabica
1310    Arabica
Name: Species, Length: 1311, dtype: object

Using the above method of column access on its own returns a Series object - think of this as a DataFrame with only one column.

If you want to get the raw values however, you can simply specify this by adding .values after your entry.

Then by putting the object in a Set (which does not allow duplicate entries), we can quickly see all of the possible values for any column:

In [None]:
set(df['Variety'].values)

{'Arusha',
 'Blue Mountain',
 'Bourbon',
 'Catimor',
 'Catuai',
 'Caturra',
 'Ethiopian Heirlooms',
 'Ethiopian Yirgacheffe',
 'Gesha',
 'Hawaiian Kona',
 'Java',
 'Mandheling',
 'Marigojipe',
 'Moka Peaberry',
 'Mundo Novo',
 'Other',
 'Pacamara',
 'Pacas',
 'Pache Comun',
 'Peaberry',
 'Ruiru 11',
 'SL14',
 'SL28',
 'SL34',
 'Sulawesi',
 'Sumatra',
 'Sumatra Lintong',
 'Typica',
 'Yellow Bourbon',
 nan}

Alternatively, you can also call unique() to get all unique values in a column

In [None]:
df['Variety'].unique()

array([nan, 'Other', 'Bourbon', 'Catimor', 'Ethiopian Yirgacheffe',
       'Caturra', 'SL14', 'Sumatra', 'SL34', 'Hawaiian Kona',
       'Yellow Bourbon', 'SL28', 'Gesha', 'Catuai', 'Pacamara', 'Typica',
       'Sumatra Lintong', 'Mundo Novo', 'Java', 'Peaberry', 'Pacas',
       'Mandheling', 'Ruiru 11', 'Arusha', 'Ethiopian Heirlooms',
       'Moka Peaberry', 'Sulawesi', 'Blue Mountain', 'Marigojipe',
       'Pache Comun'], dtype=object)

Notice that the "nan" entry in the values, which in Pandas denotes a missing entry.

When working with real world datasets it's very common for entries to be missing, and there are a variety of ways of approaching a problem like this.

For now, though, we are simply going to tell Pandas to drop any row that has a missing column, using the dropna() method.

In [None]:
df_clean = df.dropna()

As you perform this analysis, you will probably notice that we've lost quite a bit of our original data by simply dropping the nan values.

There is another approach that we can examine, however. Instead of dropping the missing entries entirely, we can impute their value using the data we do have.

For a single column we can do this like so:

In [None]:
avg_altitude_mean_meters = df['altitude_mean_meters'].mean()
avg_altitude_mean_meters

1784.104127675277

In [None]:
df['altitude_mean_meters_imputed'] = df['altitude_mean_meters'].fillna(avg_altitude_mean_meters)

In [None]:
df[['altitude_mean_meters','altitude_mean_meters_imputed']].head(10)

Unnamed: 0,altitude_mean_meters,altitude_mean_meters_imputed
0,2075.0,2075.0
1,2075.0,2075.0
2,1700.0,1700.0
3,2000.0,2000.0
4,2075.0,2075.0
5,,1784.104128
6,,1784.104128
7,1635.0,1635.0
8,1635.0,1635.0
9,1822.5,1822.5


Now we have replaced the useless NaN values with the average height. While this obviously isn't as good as original data, in a lot of situations this can be a step up from losing rows entirely.

Sophisticated analysis can be done in only a few lines using Pandas. Let's say that we want to get the average coffee rating by country. First, we can use the groupby method to automatically collect the results by country. Then, we can select the column we want - quality_score - and calculate its mean the same way we would using NumPy:

In [None]:
df_clean.groupby('Country of Origin')['quality_score'].mean()

Country of Origin
Brazil                          82.330725
China                           80.868000
Colombia                        82.932000
Costa Rica                      83.090000
El Salvador                     82.804545
Ethiopia                        87.792500
Guatemala                       81.957832
Haiti                           80.750000
Honduras                        81.010476
Indonesia                       81.524286
Kenya                           85.415000
Laos                            82.000000
Malawi                          81.711818
Mexico                          80.246087
Myanmar                         80.666667
Nicaragua                       79.333000
Panama                          81.750000
Peru                            77.000000
Philippines                     80.312500
Taiwan                          82.462895
Tanzania, United Republic Of    82.411724
Uganda                          83.778333
Name: quality_score, dtype: float64

This is certainly interesting, but it could be presented better. First, all of the ratings are pretty high (what's the highest and lowest rating?). Let's standardize to unit mean and variance so that we can tell the difference more easily. We'll just do that on our subset here for now, but you can apply it to the entire dataset too!

In [None]:
country_means = df_clean.groupby('Country of Origin')['quality_score'].mean()
mu,si = country_means.mean(), country_means.std() #Calculate the overall mean and standard deviation of the quality scores
country_means -= mu #Subtract the mean from every entry
country_means /= si #Divide every entry by the standard deviation
country_means

Country of Origin
Brazil                          0.194625
China                          -0.491541
Colombia                        0.476684
Costa Rica                      0.550802
El Salvador                     0.416895
Ethiopia                        2.756749
Guatemala                       0.019701
Haiti                          -0.546895
Honduras                       -0.424705
Indonesia                      -0.183677
Kenya                           1.641462
Laos                            0.039482
Malawi                         -0.095705
Mexico                         -0.783281
Myanmar                        -0.585987
Nicaragua                      -1.211611
Panama                         -0.077794
Peru                           -2.306024
Philippines                    -0.752127
Taiwan                          0.256626
Tanzania, United Republic Of    0.232622
Uganda                          0.873700
Name: quality_score, dtype: float64

In [None]:
country_means.sort_values()

Country of Origin
Peru                           -2.306024
Nicaragua                      -1.211611
Mexico                         -0.783281
Philippines                    -0.752127
Myanmar                        -0.585987
Haiti                          -0.546895
China                          -0.491541
Honduras                       -0.424705
Indonesia                      -0.183677
Malawi                         -0.095705
Panama                         -0.077794
Guatemala                       0.019701
Laos                            0.039482
Brazil                          0.194625
Tanzania, United Republic Of    0.232622
Taiwan                          0.256626
El Salvador                     0.416895
Colombia                        0.476684
Costa Rica                      0.550802
Uganda                          0.873700
Kenya                           1.641462
Ethiopia                        2.756749
Name: quality_score, dtype: float64

Finally, we'll look at indexing using Pandas. Let's say that we want to only look at the coffee entries from Taiwan. We can use the following syntax to identify those rows:

In [None]:
df_clean[df_clean['Country of Origin'] == 'Brazil']

Unnamed: 0,Acidity,Aftertaste,Aroma,Bag Weight,Balance,Body,Category.One.Defects,Category.Two.Defects,Clean Cup,Color,...,Producer,Region,Species,Sweetness,Uniformity,Variety,altitude_high_meters,altitude_low_meters,altitude_mean_meters,quality_score
225,7.67,7.75,7.58,2 kg,8.00,7.67,0,3,10.00,Green,...,Café do Paraíso,"minas gerais, br",Arabica,10.0,10.00,Mundo Novo,1183.00,894.00,1038.50,84.17
251,7.25,7.67,7.58,60 kg,7.75,8.00,0,0,10.00,Green,...,Ipanema Coffees,south of minas,Arabica,10.0,10.00,Yellow Bourbon,1260.00,1260.00,1260.00,84.00
258,7.50,7.58,7.50,60 kg,7.50,7.58,0,3,10.00,Green,...,Ipanema Agricola S.A,south of minas,Arabica,10.0,10.00,Bourbon,890.00,890.00,890.00,83.92
261,7.75,7.67,7.58,60 kg,7.83,7.58,0,7,10.00,Green,...,Ipanema Agricola,south of minas,Arabica,10.0,10.00,Bourbon,934.00,934.00,934.00,83.92
279,8.00,7.42,7.25,2 kg,7.83,7.75,0,0,10.00,Green,...,Ipanema Agrícola SA,south of minas,Arabica,10.0,10.00,Yellow Bourbon,1.00,1.00,1.00,83.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1063,7.00,7.33,7.42,2 kg,7.25,7.08,0,3,10.00,Green,...,Ipanema Agrícola SA,south of minas,Arabica,10.0,10.00,Bourbon,1200.00,1200.00,1200.00,80.58
1072,7.50,7.17,7.25,60 kg,7.25,7.33,0,3,10.00,Green,...,Lindolpho de Carvalho Dias,vale da grama,Arabica,10.0,9.33,Yellow Bourbon,1250.00,1250.00,1250.00,80.50
1075,7.33,7.08,7.42,60 kg,7.50,7.67,0,10,9.33,,...,MARIA ROGERIA COSTA PEREIRA,mantiqueira de minas,Arabica,10.0,9.33,Yellow Bourbon,1250.00,1250.00,1250.00,80.50
1094,7.08,7.42,7.17,59 kg,7.00,7.00,0,1,10.00,Green,...,Luiz Augusto Pereira Moguilod,cerrado - monte carmelo - minas gerais,Arabica,10.0,10.00,Catuai,995.00,995.00,995.00,80.25


Say that out of the Brazilian coffees, we only want to look at those which are the Bourbon variety. We can also chain those indexing operations like so:

In [None]:
df_clean[df_clean['Country of Origin'] == 'Brazil'][df_clean['Variety'] == 'Bourbon']

  df_clean[df_clean['Country of Origin'] == 'Brazil'][df_clean['Variety'] == 'Bourbon']


Unnamed: 0,Acidity,Aftertaste,Aroma,Bag Weight,Balance,Body,Category.One.Defects,Category.Two.Defects,Clean Cup,Color,...,Producer,Region,Species,Sweetness,Uniformity,Variety,altitude_high_meters,altitude_low_meters,altitude_mean_meters,quality_score
258,7.5,7.58,7.5,60 kg,7.5,7.58,0,3,10.0,Green,...,Ipanema Agricola S.A,south of minas,Arabica,10.0,10.0,Bourbon,890.0,890.0,890.0,83.92
261,7.75,7.67,7.58,60 kg,7.83,7.58,0,7,10.0,Green,...,Ipanema Agricola,south of minas,Arabica,10.0,10.0,Bourbon,934.0,934.0,934.0,83.92
282,7.5,7.58,7.67,2 kg,8.17,7.83,0,1,10.0,Green,...,Ipanema Agrícola SA,south of minas,Arabica,9.33,10.0,Bourbon,1200.0,1200.0,1200.0,83.83
296,7.75,7.58,7.67,60 kg,7.67,7.67,0,5,10.0,Green,...,Ipanema Agricola S.A,south of minas,Arabica,10.0,10.0,Bourbon,890.0,890.0,890.0,83.75
314,7.67,7.5,7.67,60 kg,7.5,7.67,0,2,10.0,Green,...,Ipanema Agricola S.A,south of minas,Arabica,10.0,10.0,Bourbon,890.0,890.0,890.0,83.67
318,7.75,7.58,7.58,60 kg,7.58,7.58,0,3,10.0,Green,...,Ipanema Agricola,south of minas,Arabica,10.0,10.0,Bourbon,934.0,934.0,934.0,83.67
335,7.58,7.58,7.58,60 kg,7.58,7.67,0,5,10.0,Green,...,Ipanema Agricola S.A,south of minas,Arabica,10.0,10.0,Bourbon,890.0,890.0,890.0,83.58
441,7.58,7.5,7.67,60 kg,7.58,7.67,0,4,10.0,Green,...,Ipanema Agricola S.A,south of minas,Arabica,10.0,10.0,Bourbon,890.0,890.0,890.0,83.17
485,7.42,7.67,7.58,2 kg,7.5,7.42,0,1,10.0,Green,...,Ipanema Agrícola SA,south of minas,Arabica,10.0,10.0,Bourbon,1200.0,1200.0,1200.0,83.08
497,7.75,7.5,7.58,60 kg,7.5,7.67,0,4,10.0,Green,...,Ipanema Agricola S.A,south of minas,Arabica,10.0,10.0,Bourbon,890.0,890.0,890.0,83.0


There are a lot of useful functions in pandas that we do not have enough time to cover, we strongly encourage you to read their documentations for reference.

https://pandas.pydata.org/docs/