# DS3000 Foundations of Data Science
## Dialogue of Civilizations
May 13, 2025

Content:
- Canvas/Admin/Syllabus Review
- Python Basics
- Modules needed today `pip install collections seaborn numpy pandas` (you may not *need* to do all these, but why not?)
    - Some of this will be review from the **Pre-Course Notebook** (see Canvas) and we may go through it pretty quickly
        - Assert
        - numpy & arrays
        - pandas
            - series
            - dataframe

Planned Time: ~1 hour

Next Thing: Python Lab (Lab 1)

# Admin/Syllabus

Take a look at the course Canvas page ([link](https://northeastern.instructure.com/courses/221176)), the tentative schedule and syllabus together.    

## Important Python Basics
Some of the following will be similar in content to the [Pre-Course Notebook](https://northeastern.instructure.com/courses/221176/files/34723234?module_item_id=12064061) which contains a summary of the basic python tools you should know in starting this course. If you have not seen python before, we'll have our first lab to practic it in a bit, but I **strongly** recommend you work through the Pre-Course Notebook for practice tonight.

To start; everyone should have some concept of what a dictionary is. Let's begin by proving you do. Create a dictionary in the empty cell below which:
- has three keys, "Key A", "Key B", "Key C"
- each value is a list:
    - the list for "Key A" is the numeric list [1, 2, 3, 4, 5]
    - the list for "Key B" is a list of **five** names
    - the list for "Key C" is a list of **five** boolean values (True/False)

In [None]:
dict = {"Key A":[1,2,3,4,5], "Key B": ['John',"Jack","Jeffrey","Joseph","Genevieve"], "Key C": [True,False,True,True,False]}
print(dict)


{'Key A': [1, 2, 3, 4, 5],
 'Key B': ['Eric', 'Mohit', 'Chieh', 'Laney', 'Mark'],
 'Key C': [True, False, True, False, True]}

In [2]:
# calling the list from Key A
my_dict["Key A"]

[1, 2, 3, 4, 5]

### Question: Can you call a key from a dictionary based on a value?

Yes! There are a couple ways, but perhaps most simply, make use of `list()`, `.keys()`, `.values()`, and `.index()` and you can do it in one line (could also break this up into parts):

In [3]:
dict_a = {'first': 10, 'second': ["a", "list"], 'third': True}

# for those who may need a second to process the below:
# take the keys of dict_a and make them into a list
# take the values of dict_a and make them into a list
# get the index of the value that corresponds to True
# use square brackets to get the key corresponding to that index
list(dict_a.keys())[list(dict_a.values()).index(True)]

'third'

In [4]:
# or:
list(dict_a.keys())[list(dict_a.values()).index(10)]

'first'

You should also have some basic knowledge of what a loop (especially a for loop) is. Below, let's practice an important aspect of working with loops.

### Iterating through a list (or a tuple):
We often want to operate on all the items in some collection of items (list, tuple, strings, dictionaries, etc.):

In [5]:
# iterate through the list and square everything, saving to a new list
# the numbers we want to square
some_list = [4,8,-2,6]
# an empty list to put the squared values in
another_list = list()

#the for loop that does it
for item in some_list:
    another_list.append(item ** 2)
    
another_list

[16, 64, 4, 36]

(++) need both the index and item?  Try [enumerate](https://docs.python.org/3.8/library/functions.html#enumerate)

In [6]:
# if you need an index too, use enumerate
for idx, item in enumerate(some_list):
    print(idx, item)

0 4
1 8
2 -2
3 6


Now, practice using the same strategy to loop through a dictionary (say, the dictionary from the first practice) and print out each key and value pair.

In [None]:
# do it here

Key A
[1, 2, 3, 4, 5]
Key B
['Eric', 'Mohit', 'Chieh', 'Laney', 'Mark']
Key C
[True, False, True, False, True]


## Functions in a bit more detail

In the pre-course notebook (or DS 2000/2500), you should have seen:

* defining and calling functions
* functions with multiple inputs
* functions with multiple outputs (tuple unpacking to the rescue!)

In this lesson, we will focus on making sure you understand the usage of:

* assertion

## Assert (function behavior)

We may want to check that the inputs to a function are appropriate.

In [8]:
def alpha_sort_list(list_in):
    """ sorts a list, alphabetically (regardless of case)
    
    Args:
        list_in (list): list of strings
        
    Return:
        list_out (list): list of strings, alpha sorted
    """
    
    return sorted(list_in)

Does the function above work?  Seems to work on first glance ... to be sure lets build a little set of test cases and ensure it works as expected.

A test case is a set of inputs and outputs to a function with our intended behavior.

In [9]:
# assert alpha_sort_list(['Eliana', 'Callum', 'Bruno']) == ['Bruno', 'Callum', 'Eliana']
# assert alpha_sort_list(['Eliana', 'callum', 'Bruno']) == ['Bruno', 'callum', 'eliana']

In the cell below, write a function that takes two inputs (a and b) and returns their product. Make sure your function passes the assert provided below. (Notice, this is much simpler than most of the functions you will be writing...)

In [None]:
# write your function here

assert my_func(a = 2, b = 4) == 8

Python (if you've never used it before) relies on may libraries that must be imported in order to accomplish many tasks. For example, to take the square root of a number, you could either using default python to raise a number to the power of $.5$:

In [11]:
2**(.5)

1.4142135623730951

Or, you could import the `numpy` or `math` modules, which can do that for you (plus, do a lot of other more useful things, as you will see):

In [12]:
# numpy is one of the most important python modules
import numpy as np
import math

print(np.sqrt(2))
print(math.sqrt(2))

1.4142135623730951
1.4142135623730951


In [13]:
# this module allows you to initialize an empty dictionary
from collections import defaultdict

# another example function
def create_dict(featurenames, features):
    """ creates a dictionary with keys and values corresponding to the inputs

    Args:
        featurenames (list): a list of strings that will serve as the keys to the dictionary
        features (list): a list of values (any type) that will serve as the values of the dictionary

    Returns:
        dict (dictionary): the dictionary
    """

    # create an empty dictionary that will take lists as values
    dict = defaultdict(list)
    
    try:
        for i in range(len(featurenames)):
            # for each feature name, create a key-value pair in the dictionary
            dict[str(featurenames[i])] = features[i]

            # if age is a feature, create a logical column for an age 30 cut-off
            if featurenames[i] == "age":
                dict["old"] = [i > 30 for i in features[i]]
        return dict
    except:
        # if something doesn't work above, let us know
        print("Failure to create dictionary; check inputs")

    

In [14]:
test = ["bloodtype", "age", "height (cm)", "weight (lbs)"] # this should not work, why? how do fix it?
vals = [["A", "B", "AB", "AB+"], [28, 31, 16, 88], [180, 194, 136, 178]]

my_dict2 = create_dict(test, vals)
my_dict2

Failure to create dictionary; check inputs


# Representing data as arrays (or, matrices, if you prefer, though "array" is more general)
Its often a convenient analogy to consider a dataset as a big table.  A dataset describes the **features** of a collection of **samples**:
- each row represents a sample (or, observation)
    - e.g. a country, or females of a certain age and education in a country
- each column represents a feature (or, variable)
    - e.g. how satisfied a country is, or how satisfied females of a certain age and education in a country are
- the intersection of a row and column contains the feature of the sample
    - e.g. how satisfied Senior Males with Secondary education in Belgium are
 
The below data come from [eurostat](https://ec.europa.eu/eurostat/web/main/data/database) (which we'll be visiting!)
    
<img src="https://ec.europa.eu/eurostat/o/estat-theme-ecl/images/header/estat-logo-horizontal.svg?browserId=other&minifierType=js&languageId=en_GB&t=1715007920000" width=300 />


In [15]:
# (we'll cover this code later, for now I just want us all to
# look at a dataset together)
import pandas as pd

url = 'https://raw.githubusercontent.com/eaegerber/data/main/EU_life_sat.csv'
df_EUlife = pd.read_csv(url, encoding='unicode_escape')
belgium = df_EUlife['Country'] == 'Belgium'
belgium

0       True
1      False
2      False
3      False
4      False
       ...  
636    False
637    False
638    False
639    False
640    False
Name: Country, Length: 641, dtype: bool

In [16]:
df_EUlife

Unnamed: 0,Country,LifeSat,TrustInOthers,Sex,AgeClass,EducationClass
0,Belgium,7.8,6.0,Male,Youth,Primary
1,Bulgaria,7.0,5.1,Male,Youth,Primary
2,Czechia,8.3,5.5,Male,Youth,Primary
3,Denmark,7.7,5.0,Male,Youth,Primary
4,Germany,7.7,5.4,Male,Youth,Primary
...,...,...,...,...,...,...
636,Montenegro,6.9,6.8,Female,Senior,Tertiary
637,North Macedonia,6.3,5.4,Female,Senior,Tertiary
638,Albania,5.5,5.4,Female,Senior,Tertiary
639,Serbia,6.2,5.4,Female,Senior,Tertiary


Why represent data in 2d arrays?
- many datasets well encapsulated as a 2d array with 
    - different rows used for samples
    - different col used feature
- Arrays (matrices) are natural math objects in linear algebra, probability and statistics all of which underpin machine learning.

## **NumPy** (**Numerical Python**) Library
* First appeared in 2006 and is the **preferred Python array implementation**.
* High-performance, richly functional **_n_-dimensional array** type called **`ndarray`**. 
* **Written in C** and **up to 100 times faster than lists**.
* Critical in big-data processing, AI applications and much more. 
* According to `libraries.io`, **over 450 Python libraries depend on NumPy**. 
* Many popular data science libraries such as Pandas, SciPy (Scientific Python) and Keras (for deep learning) are built on or depend on NumPy. 

Big Question:
```
What is an array/matrix?  (and how is different than a list or list of lists?)
```

| Array                                 | List (Python: Dynamic Array)                         |
|---------------------------------------|------------------------------------------------------|
| Size is static (contiguous memory)    | Size can be modified quickly (non-contiguous memory) |
| Quick to compute (esp Linear Algebra) | Slower to compute (and clumsy looking code)          |
| contains 1 datatype (numeric)         | may contain many data types (need not be numeric)    |

### Initializing arrays:
- 1d from list / tuple
- 2d from list / tuple

In [17]:
import numpy as np

# x is a 1d array (3)
x = np.array((1, 2, 3))
x

array([1, 2, 3])

In [18]:
# y is a 2d array (2, 3)
y = np.array([[1, 2, 3],
              [4, 5, 6]])
y

array([[1, 2, 3],
       [4, 5, 6]])

### Building some special matrices
- zeros
- ones
- full 
- identity

#### Convention: Rows First!
- we describe array shape as `(n_rows, n_cols)`
- we index into an array as `x[row_idx, col_idx]`

In [19]:
# shape = (n_rows, n_cols)
# shape = (height, width)
# .zeros gives an array of all zeros
z = np.zeros((5, 2)) # tall array
z

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [20]:
# .ones gives an array of all ones
one_array = np.ones((2, 5), dtype=int)
one_array

array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

In [21]:
# can use .full to create an array of all fill_value
# np.full(shape=(2,5), fill_value=2)
two_array = np.full(shape=(2, 5), fill_value=2.0)
two_array

array([[2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.]])

In [22]:
# identity matrix
# 1's on the diagonal, 0s elsewhere
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

### Array Attributes
- shape
- size
- ndim

Numpy can build arrays out of many different number types (bool, int, float).  ([see also](https://numpy.org/doc/stable/user/basics.types.html#:~:text=There%20are%205%20basic%20numerical,point%20(float)%20and%20complex.&text=NumPy%20knows%20that%20int%20refers,int_%20%2C%20bool%20means%20np.))
- dtype
    - astype
- nbytes

In [23]:
x = np.array([[1, 2, 3],
              [4, 5, 6]]) 

In [24]:
# whether you see int32 or int64 depends on the bit size; it should not truly matter
x.dtype

dtype('int32')

In [25]:
# ndim is the (n)umber of (dim)ensions
x.ndim

2

In [26]:
# shape gives the values of the dimensions
x.shape

(2, 3)

In [27]:
# size is total number of elements
x.size

6

In [28]:
x.nbytes

24

## Manipulating array shape

### Diagonal

The diagonal of each array is shaded below, the unshaded elements are not on the diagonal of the matrix:

$$ \begin{bmatrix}
\blacksquare & \square & \square\\
\square & \blacksquare & \square\\
\square & \square & \blacksquare\\
\square & \square & \square\\
\end{bmatrix} 
\hspace{2cm}
\begin{bmatrix}
\blacksquare & \square & \square & \square & \square\\
\square & \blacksquare & \square& \square & \square\\
\square & \square & \blacksquare& \square & \square\end{bmatrix}
\hspace{2cm}
\begin{bmatrix}
\blacksquare & \square & \square\\
\square & \blacksquare & \square\\
\square & \square & \blacksquare
\end{bmatrix} 
$$

### Numpy methods
- transpose
- `.reshape()`
    - order of reshape (row or column first?)

In [29]:
x = np.array([[1, 2, 3],
              [4, 5, 6]]) 
x

array([[1, 2, 3],
       [4, 5, 6]])

In [30]:
# transpose (.T): flip across the diagonal
y = x.T
y

array([[1, 4],
       [2, 5],
       [3, 6]])

In [31]:
# reshape allows us to change shape of matrix by defining the dimensions
x.reshape((1, 6))

array([[1, 2, 3, 4, 5, 6]])

In [32]:
# (new matrix must have same total number of elements)
# x.reshape((1, 8))

## Array Indexing (slicing)

You can index arrays, using `start:stop:step` indexing!

In [33]:
x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [34]:
x[3]

3

In [35]:
x[2:4]

array([2, 3])

In [36]:
x[-3:]

array([7, 8, 9])

A two dimensional array requires two indices to get a value: `x[row_idx, col_idx]`

(Just like our convention for rows first in shape, the row index comes first as we index into the array)

In [37]:
x = np.arange(20).reshape((4, 5))
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [38]:
# row_idx=1 (second row since python starts counting at 0)
# col_idx=2 (third column since python starts counting at 0)
x[1, 2]

7

In [39]:
# we can start:stop:step slice either index

# get a slice of rows and a constant column
x[0:2, 2]

array([2, 7])

In [40]:
# get a slice of columns and a constant row
x[2, 0:2]

array([10, 11])

# We will discuss various other NumPy things as they arise! Next up:

# Pandas

Pandas is a python module which stores data.  

### If we already have `np.array()`, why do we need pandas?
- pandas supports non numeric data (strings for categorical data, for example)
- pandas supports reading / storing data from more formats
    - csv (spreadsheets)
- pandas more elegantly deals with missing data
- pandas handles indexing woes

You could do almost everything pandas does with numpy arrays ... but it would be much more difficult to accomplish.

### Pandas has two essential objects:
- **dataframe**
    - 2 dimensional data structure
    - you've already seen one today!  (we replicate below)
- **series (vectors)**
    - 1 dimensional data structure, each item associated with some index
    - you could store the weight of all the penguins as a series 
        - (all samples of one feature)
    - you could store the weight, bill size, sex, island, etc for a single penguin as a series
        - (all features for one sample)

In [41]:
# recall how we got our data from before
import pandas as pd

url = 'https://raw.githubusercontent.com/eaegerber/data/main/EU_life_sat.csv'
df_EUlife = pd.read_csv(url, encoding='unicode_escape')
df_EUlife.head()

Unnamed: 0,Country,LifeSat,TrustInOthers,Sex,AgeClass,EducationClass
0,Belgium,7.8,6.0,Male,Youth,Primary
1,Bulgaria,7.0,5.1,Male,Youth,Primary
2,Czechia,8.3,5.5,Male,Youth,Primary
3,Denmark,7.7,5.0,Male,Youth,Primary
4,Germany,7.7,5.4,Male,Youth,Primary


In [42]:
# the table above is a dataframe
type(df_EUlife)

pandas.core.frame.DataFrame

In [43]:
belgium = df_EUlife.Country == 'Belgium'
df_EUlife.loc[belgium,'Country':'TrustInOthers']

Unnamed: 0,Country,LifeSat,TrustInOthers
0,Belgium,7.8,6.0
37,Belgium,7.1,4.8
69,Belgium,7.2,5.9
106,Belgium,7.8,6.1
143,Belgium,7.0,4.7
178,Belgium,7.3,6.2
215,Belgium,8.1,6.5
252,Belgium,7.5,5.8
289,Belgium,7.7,6.3
326,Belgium,7.8,6.2


## Pandas Series
### building:
- building: default index
- building: custom index
- building: from a dict

In [44]:
# each row, or column is a series object
# this represents first row of dataframe
country0_series = df_EUlife.iloc[0, :]
country0_series

Country           Belgium
LifeSat               7.8
TrustInOthers         6.0
Sex                  Male
AgeClass            Youth
EducationClass    Primary
Name: 0, dtype: object

Pandas series contain a sequence of labelled data elements:
- country0's `Country` is `Belgium`
- country0's `LifeSat` is `7.8`
- country0's `<index-name>` is `<corresponding-value>`

A series is quite similar to a dictionary ...

In [45]:
country0_dict = {'Country': 'Belgium',
 'LifeSat': 7.8,
 'TrustInOthers': 6.0,
 'Sex': 'Male',
 'AgeClass': 'Youth',
 'EducationClass': 'Primary'}

In [46]:
# build a series from dict
country0_series = pd.Series(country0_dict)
country0_series

Country           Belgium
LifeSat               7.8
TrustInOthers         6.0
Sex                  Male
AgeClass            Youth
EducationClass    Primary
dtype: object

In [47]:
# you can also pass two corresponding lists / tuples
index = ['Country', 'LifeSat', 'TrustInOthers', 'Sex',
       'AgeClass', 'EducationClass']
values = ['Belgium', 7.8, 6.0, 'Male', 'Youth', 'Primary']

country0_series = pd.Series(values, index=index)
country0_series

Country           Belgium
LifeSat               7.8
TrustInOthers         6.0
Sex                  Male
AgeClass            Youth
EducationClass    Primary
dtype: object

In [48]:
# sometimes your data has no meaningful index
# pandas will default to indexing things with integers
ice_cream_flavors = 'vanilla', 'chocolate', 'cherry garcia', 'oatmeal'
pd.Series(ice_cream_flavors)

0          vanilla
1        chocolate
2    cherry garcia
3          oatmeal
dtype: object

In [49]:
# you can access values via .values
country0_series.values

array(['Belgium', 7.8, 6.0, 'Male', 'Youth', 'Primary'], dtype=object)

In [50]:
# you can access index via .index
country0_series.index

Index(['Country', 'LifeSat', 'TrustInOthers', 'Sex', 'AgeClass',
       'EducationClass'],
      dtype='object')

## Pandas: DataFrame

Remember:
- `Series`:  1d data object
- `DataFrame`: 2d data object

`DataFrame`s represent two-dimensional data, for example, grades:

|           | Quiz 0 | Quiz 1 | Quiz 2 |
|-----------|--------|--------|--------|
| Student 0 | 80     | 90     | 50     |
| Student 1 | 87     | 92     | 80     |

Each column or row above could be considered a `Series` object (as we'll see later, we can indeed extract a single row or column of a dataframe as a `Series` object).

In [51]:
import pandas as pd
import numpy as np

quiz_array = np.array([[80, 90, 50],
                 [87, 92, 80]])

df_quiz = pd.DataFrame(quiz_array, 
                       columns=('quiz0', 'quiz1', 'quiz2'), 
                       index=('student0', 'student1'))
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


In [52]:
# we construct a dataframe as a dictionary
# keys of the dictionary are columns of dataframe
# values are lists (or tuples) of the values in each column
quiz_dict = {'quiz0': [80, 87],
            'quiz1': [90, 92],
            'quiz2': [50, 80]}
pd.DataFrame(quiz_dict, index=('student0', 'student1'))

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


In [53]:
# another way to construct, this time the transpose
quiz_dict2 = {'student0': [80, 90, 50],
             'student1': [87, 92, 80]}
pd.DataFrame(quiz_dict2, index=('quiz0', 'quiz1', 'quiz2'))

Unnamed: 0,student0,student1
quiz0,80,87
quiz1,90,92
quiz2,50,80


In [54]:
# could also use .transpose() or even .T
df_quiz.transpose()
# df_quiz.T

Unnamed: 0,student0,student1
quiz0,80,87
quiz1,90,92
quiz2,50,80


In [55]:
# we can also add the column or index names after creation
df_quiz = pd.DataFrame(quiz_array)
df_quiz

Unnamed: 0,0,1,2
0,80,90,50
1,87,92,80


In [56]:
df_quiz.columns = ['quiz0', 'quiz1', 'quiz2']
df_quiz.index = ('student0', 'student1')
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


## Indexing / Accessing a DataFrame
### Can be tricky sometimes!
- indexing: 
    - `.loc[]` indexing by name of row or column
    - `.iloc[]` indexing by position integer (0, 1, 2, 3, 4 ...)
    & slicing & subsets
- using `:` to get full rows or columns

In [57]:
belgium = df_EUlife.Country == 'Belgium'
df_EUlife.loc[belgium,'Country':'TrustInOthers']

Unnamed: 0,Country,LifeSat,TrustInOthers
0,Belgium,7.8,6.0
37,Belgium,7.1,4.8
69,Belgium,7.2,5.9
106,Belgium,7.8,6.1
143,Belgium,7.0,4.7
178,Belgium,7.3,6.2
215,Belgium,8.1,6.5
252,Belgium,7.5,5.8
289,Belgium,7.7,6.3
326,Belgium,7.8,6.2


In [58]:
df_EUlife.iloc[0,0:3]

Country          Belgium
LifeSat              7.8
TrustInOthers        6.0
Name: 0, dtype: object

In [59]:
# if you access directly into dataframe, it will assume you're looking for a column
# (below is equivilent to df_quiz.loc[:, 'quiz0'])
df_EUlife['Country']

0              Belgium
1             Bulgaria
2              Czechia
3              Denmark
4              Germany
            ...       
636         Montenegro
637    North Macedonia
638            Albania
639             Serbia
640            Turkiye
Name: Country, Length: 641, dtype: object

In [60]:
# but if you are interested in only one column, they are also attributes
df_EUlife.Country

0              Belgium
1             Bulgaria
2              Czechia
3              Denmark
4              Germany
            ...       
636         Montenegro
637    North Macedonia
638            Albania
639             Serbia
640            Turkiye
Name: Country, Length: 641, dtype: object

## Modifying a DataFrame
- updating values: single cell
- adding a new column
- `pd.DataFrame.append()`
    - adds a single row to a dataframe
    - deprecated, but works for now. will eventually be replaced with `pd.DataFrame.concat()`

In [61]:
# setting single entry in dataframe
df_quiz.loc['student0', 'quiz1'] = 123
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,123,50
student1,87,92,80


In [62]:
# adding a new column (which student got which grade?)
# notice data frames can include columns of multiple types!
df_quiz['overall grade'] = 'a', 'b' 
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2,overall grade
student0,80,123,50,a
student1,87,92,80,b


In [63]:
# delete a column
del df_quiz['overall grade']
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,123,50
student1,87,92,80


In [64]:
# adding a column (next 2 cells) more error robust way of handling indexing
# by explicitly labelling the index we're sure to match more explicitly
s_overgrade = pd.Series({'student1': 'b-',
                         #'student0': 'a+',
                        'student2': 'f (no quizzes taken)'})
s_overgrade

student1                      b-
student2    f (no quizzes taken)
dtype: object

In [65]:
# notice how pandas helps us out in aligning our new column with proper row
# (and avoids including student2)
df_quiz['overall grade'] = s_overgrade
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2,overall grade
student0,80,123,50,
student1,87,92,80,b-


In [66]:
# how to 'drop' a row (returns a dataframe with row removed)
df_quiz_short = df_quiz.drop('student0')
df_quiz_short

Unnamed: 0,quiz0,quiz1,quiz2,overall grade
student1,87,92,80,b-


In [67]:
# rebuild df_quiz
quiz_dict = {'quiz0': [80, 87],
            'quiz1': [90, 92],
            'quiz2': [50, 80]}
df_quiz = pd.DataFrame(quiz_dict, index=('student0', 'student1'))
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


In [68]:
# notice: name of series ends up on index of dataframe
# notice: order of items in series doesnt matter, they're aligned by index
s_student3 = pd.Series({'quiz1': 90,
                        'quiz2': 100,
                        'quiz0': 95},
                      name='student3')
s_student3

quiz1     90
quiz2    100
quiz0     95
Name: student3, dtype: int64

In [69]:
# add new row to dataframe
pd.concat([df_quiz, s_student3.to_frame().T])

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80
student3,95,90,100


In [70]:
# also notice: .concat() returns a copy of df_quiz, it isn't modified above
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


In [71]:
# thus, must overwrite:
df_quiz = pd.concat([df_quiz, s_student3.to_frame().T])
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80
student3,95,90,100


In [72]:
# adding a column that is a function of other columns:
df_quiz['average'] = (df_quiz['quiz0'] + df_quiz.quiz1 + df_quiz.quiz2)/3
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2,average
student0,80,90,50,73.333333
student1,87,92,80,86.333333
student3,95,90,100,95.0


In [73]:
df_quiz.average

student0    73.333333
student1    86.333333
student3    95.000000
Name: average, dtype: float64

### Boolean Indexing into DataFrame

Sometimes we want to grab only the rows or columns which meet a particular condition.

"Get all rows for the country Belgium"

In [74]:
belgium = df_EUlife.Country == 'Belgium'
df_EUlife.loc[belgium,'Country':'TrustInOthers']

Unnamed: 0,Country,LifeSat,TrustInOthers
0,Belgium,7.8,6.0
37,Belgium,7.1,4.8
69,Belgium,7.2,5.9
106,Belgium,7.8,6.1
143,Belgium,7.0,4.7
178,Belgium,7.3,6.2
215,Belgium,8.1,6.5
252,Belgium,7.5,5.8
289,Belgium,7.7,6.3
326,Belgium,7.8,6.2


Or, "Grab all rows where LifeSat is above 8"

In [None]:
# try it here


Unnamed: 0,Country,LifeSat,TrustInOthers,Sex,AgeClass,EducationClass
2,Czechia,8.3,5.5,Male,Youth,Primary
6,Ireland,8.9,7.9,Male,Youth,Primary
9,France,8.4,6.0,Male,Youth,Primary
10,Croatia,8.4,6.2,Male,Youth,Primary
19,Austria,8.6,6.2,Male,Youth,Primary
...,...,...,...,...,...,...
630,Finland,8.3,7.8,Female,Senior,Tertiary
631,Sweden,8.1,6.2,Female,Senior,Tertiary
632,Iceland,8.4,6.8,Female,Senior,Tertiary
633,Norway,8.2,7.0,Female,Senior,Tertiary


In [76]:
# we can build more complex conditions using 
# & (and operator)
# | (or operator)

# all students who got at least an 80 on quiz1 but scored less than 75 on quiz2
s_bool = (df_quiz.quiz1 >= 80) & (df_quiz.quiz2  < 75)
s_bool

student0     True
student1    False
student3    False
dtype: bool

In [77]:
df_quiz.loc[s_bool, :]

Unnamed: 0,quiz0,quiz1,quiz2,average
student0,80,90,50,73.333333


# Loading Data into Pandas
## More on this after the Data Lab

Data comes from many places:
- Web Scraping
- Application Program Interface (API)
- SQL
- local file:
    - csv
    - JSON
    - fixed width tables (HTML)
    
### Pandas functions which load data
| Mode | Description
| ------ | :------
| **`read_csv`** | Load comma seperated values data from a file or URL (other delimeters too!)
| **`read_xlsx`** | Read data in xls format (Mircosoft Excel)
| **`read-fwf`** | Read data in fixed-width column format (i.e., no delimiters such as tab-separated txt files)
| **`read_clipboard`** | Version of read_csv that reads data from the clipboard; useful for converting tables from web pages
| **`read_html`** | Read all tables contained in the given HTML document.
| **`read_json`** | Read data from a JSON (JavaScript Object Notation) string representation

## Reading local or non-local CSVs into Pandas
- read_csv
- index_col
- header

In [78]:
# recall how we got our data from before
import pandas as pd

# from a url (in this case, Dr. Gerber's GitHub)
url = 'https://raw.githubusercontent.com/eaegerber/data/main/EU_life_sat.csv'
df_EUlife = pd.read_csv(url, encoding='unicode_escape')
df_EUlife.head()

Unnamed: 0,Country,LifeSat,TrustInOthers,Sex,AgeClass,EducationClass
0,Belgium,7.8,6.0,Male,Youth,Primary
1,Bulgaria,7.0,5.1,Male,Youth,Primary
2,Czechia,8.3,5.5,Male,Youth,Primary
3,Denmark,7.7,5.0,Male,Youth,Primary
4,Germany,7.7,5.4,Male,Youth,Primary


In [79]:
# from a local file
# downloaded this .csv from Kaggle (https://www.kaggle.com/datasets/dipeshkhemani/airbnb-cleaned-europe-dataset?resource=download)
# it contains Airbnb stays in Europe with the type of stay, prices and how far it is from the metro distance and city centre and many other details
# note: file must be in same folder as jupyter notebook
df_EUABnB = pd.read_csv('Aemf1.csv')
df_EUABnB.head()

Unnamed: 0,Fake Host ID,City,Price,Day,Room Type,Shared Room,Private Room,Person Capacity,Superhost,Multiple Rooms,Business,Cleanliness Rating,Guest Satisfaction,Bedrooms,City Center (km),Metro Distance (km),Attraction Index,Normalised Attraction Index,Restraunt Index,Normalised Restraunt Index
0,22969,Amsterdam,194.033698,Weekday,Private room,False,True,2,False,1,0,10,93,1,5.022964,2.53938,78.690379,4.166708,98.253896,6.846473
1,13957,Amsterdam,344.245776,Weekday,Private room,False,True,4,False,0,0,8,85,1,0.488389,0.239404,631.176378,33.421209,837.280757,58.342928
2,4918,Amsterdam,264.101422,Weekday,Private room,False,True,2,False,0,1,9,87,1,5.748312,3.651621,75.275877,3.985908,95.386955,6.6467
3,20329,Amsterdam,433.529398,Weekday,Private room,False,True,4,False,0,1,9,90,2,0.384862,0.439876,493.272534,26.119108,875.033098,60.973565
4,38906,Amsterdam,485.552926,Weekday,Private room,False,True,2,True,0,0,10,98,1,0.544738,0.318693,552.830324,29.272733,815.30574,56.811677


In [80]:
# how to specify index col (make sure this is uniquely identifiable!)
df_EUABnB = pd.read_csv('Aemf1.csv', index_col='Fake Host ID')
df_EUABnB.head()

Unnamed: 0_level_0,City,Price,Day,Room Type,Shared Room,Private Room,Person Capacity,Superhost,Multiple Rooms,Business,Cleanliness Rating,Guest Satisfaction,Bedrooms,City Center (km),Metro Distance (km),Attraction Index,Normalised Attraction Index,Restraunt Index,Normalised Restraunt Index
Fake Host ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
22969,Amsterdam,194.033698,Weekday,Private room,False,True,2,False,1,0,10,93,1,5.022964,2.53938,78.690379,4.166708,98.253896,6.846473
13957,Amsterdam,344.245776,Weekday,Private room,False,True,4,False,0,0,8,85,1,0.488389,0.239404,631.176378,33.421209,837.280757,58.342928
4918,Amsterdam,264.101422,Weekday,Private room,False,True,2,False,0,1,9,87,1,5.748312,3.651621,75.275877,3.985908,95.386955,6.6467
20329,Amsterdam,433.529398,Weekday,Private room,False,True,4,False,0,1,9,90,2,0.384862,0.439876,493.272534,26.119108,875.033098,60.973565
38906,Amsterdam,485.552926,Weekday,Private room,False,True,2,True,0,0,10,98,1,0.544738,0.318693,552.830324,29.272733,815.30574,56.811677


In [81]:
# we've been looking at the first few observations (head)
# or we could look at the last few (tail)
df_EUABnB.tail()

Unnamed: 0_level_0,City,Price,Day,Room Type,Shared Room,Private Room,Person Capacity,Superhost,Multiple Rooms,Business,Cleanliness Rating,Guest Satisfaction,Bedrooms,City Center (km),Metro Distance (km),Attraction Index,Normalised Attraction Index,Restraunt Index,Normalised Restraunt Index
Fake Host ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
35231,Vienna,715.938574,Weekend,Entire home/apt,False,False,6,False,0,1,10,100,3,0.530181,0.135447,219.402478,15.712158,438.756874,10.604584
25448,Vienna,304.79396,Weekend,Entire home/apt,False,False,2,False,0,0,8,86,1,0.810205,0.100839,204.970121,14.678608,342.182813,8.270427
25353,Vienna,637.168969,Weekend,Entire home/apt,False,False,2,False,0,0,10,93,1,0.994051,0.202539,169.073402,12.107921,282.296424,6.822996
26349,Vienna,301.054157,Weekend,Private room,False,True,2,False,0,0,10,87,1,3.0441,0.287435,109.236574,7.822803,158.563398,3.832416
33585,Vienna,133.230489,Weekend,Private room,False,True,4,True,1,0,10,93,1,1.263932,0.480903,150.450381,10.774264,225.247293,5.44414


In [82]:
# look at stats for the different groups
day_ABnB = df_EUABnB.groupby("Day")
day_ABnB.describe()

Unnamed: 0_level_0,Price,Price,Price,Price,Price,Price,Price,Price,Person Capacity,Person Capacity,...,Restraunt Index,Restraunt Index,Normalised Restraunt Index,Normalised Restraunt Index,Normalised Restraunt Index,Normalised Restraunt Index,Normalised Restraunt Index,Normalised Restraunt Index,Normalised Restraunt Index,Normalised Restraunt Index
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Weekday,20886.0,257.004009,307.186451,37.129295,140.513732,199.812383,289.86858,18545.450285,20886.0,3.251317,...,856.359033,4592.883342,20886.0,26.199066,18.259998,0.66701,12.225004,21.791623,37.123893,100.0
Weekend,20828.0,263.193442,248.422795,34.779339,146.810507,207.820352,301.305054,13656.358834,20828.0,3.223113,...,864.709747,6696.156772,20828.0,24.906316,18.685124,0.592757,8.867334,21.845774,36.47094,100.0


In [83]:
# too much info there, what about Cleanliness Rating specifically
day_ABnB['Cleanliness Rating'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Weekday,20886.0,9.441348,0.888348,2.0,9.0,10.0,10.0,10.0
Weekend,20828.0,9.443201,0.89002,2.0,9.0,10.0,10.0,10.0


## Saving a DataFrame as a csv
- .to_csv()
- index=False
- header=False
- appending to csv (mode='a', header=None)

In [84]:
# doesn't save index into first column of csv
df_EUABnB.to_csv('EU_ABnB_copy.csv', index=False)

In [85]:
# doesn't save header into first row of csv
# (you usually don't want to do this)
df_EUABnB.to_csv('EU_ABnB_copy2.csv', header=False)

In [86]:
# WHY would you want to not save the header?
# you could append to an existing csv with mode = 'a'
# this stacks 10 of the same data set on top of each other (sort of silly)
# but if you are creating a data set from scratch, or merging multiple files, this may be helpful

df_EUABnB.to_csv('EU_ABnB_copy3.csv', index=False)
for _ in range(10):
    df_EUABnB.to_csv('EU_ABnB_copy3.csv', header=False, mode='a')

## What's next?

The example above seems a bit contrived, but imagine you have a web-scraping job which goes to some web page every hour and scrapes it to get some new data.  You could just add the new data as a few new rows to your existing dataset with the syntax shown above.

Next Topics:
- Getting Data from the internet (APIs and Web Scraping)
- Data Curation, EDA and Data Visualization
- Intro to Machine Learning, and ML Ethics