# Data Structures!

Monday, June 12 2023

Notebook Author: Susanna Lange, PhD

## Goals: 

- Methods of storing data
    - Lists
    - Arrays
    - Dictionaries
    - DataFrames !
    
- Along the way
    - Built in Functions
    - Methods

## Grouping and Storing Data

Python and its libraries provide many ways to group data together.   Some important ones:

- Lists, Tuples, Sets, Dictionaries (built-in to Python)

- Arrays (found in the NumPy library)

- DataFrames (in Pandas)

The above are listed in order of increasing functionality and sophistication
<br>
In general you should use the simplest one that meets your needs...



## Sequences (built into Python)


Sequences - basic type of structure to group data. <br>

The most important are **lists**! (**sets** and **tuples** also fall into this category)




A **list** is an ordered sequence of values that can be changed - mutable objects (in basic python...no library needed)



- Create a list with square brackets "[]"



### Let's create a list! 

In [None]:
#each element is separated by commas

School_locations = ["California", "Illinois", "North Carolina", "Texas", "Georgia", "Washington D.C."]  
School_locations

#### Lists can have values of different types!


In [None]:
Random_stuff = ["California", 38, 3.14159] #A string, int, float
Random_stuff

### From Morning Session...Built-in Functions


Recall Python has built-in functions of the form ```function_name()```
A very common one is the ```print()``` function!

```type()``` is another important function. As you saw, we can call this on a number, string, list, or any object to see type it is!


In [None]:
print("California")
print(type("California"))
print(type(38))
print(type("38"))
print(type(3.14))
print(type(Random_stuff))

## What can we do with lists?
 

 - Extraction and Slicing (extracting elements and sub-lists)

 
 - Methods and Built-in Functions

## Accessing values from a list


Each element of a sequence is assigned an index corresponding to its position where **indices starts at 0**. We can access an element by calling the sequence or list and putting in square brackets the element we want!


In [None]:
School_locations = ["California", "Illinois", "North Carolina", "Texas", "Georgia", "Washington D.C."]  

In [None]:
School_locations[1]

In [None]:
School_locations[0]

### <mark style="background-color: Thistle"> Code comprehension - Multiple Choice</mark>


There are six elements in our list ```School_locations```, what will happen when we run the following code? 


```python 
School_locations = ["California", "Illinois", "North Carolina", "Texas", "Georgia", "Washington D.C."] 

School_locations[6]
```


Answer here

## Accessing values from a list


Negative indexes count from the end of the sequence


In [None]:
School_locations[-1]

If our list is long or we want the first or second from end this is useful!

### We can also extract a "slice" of a list


The range of elements can be specified with colons. The output is a list starting at left index and stopping at (right index -1). Half-closed interval \[start,end)



```python 
list[start: end]
```


In [None]:
School_locations[1:3]# up to but not including the end of the slice

The above prints out element 2 and element 3


### Default Settings: List slicing

We can slice a list by 'leaving out' a starting (or ending) position. The default is to start at the beginning (or end respectively).

In [None]:
School_locations[:3] #start at beginning stop at index end-1 -----> 3-1=2


### <mark style="background-color: Thistle"> Code comprehension - Multiple Choice</mark>
What will the following output?

```python
School_locations = ["California", "Illinois", "North Carolina", "Texas", "Georgia", "Washington D.C."]  

School_locations[:]
```



A. ```[]```



B. ```["California", "Washington D.C."]  ```




C. ```["California", "Illinois", "North Carolina", "Texas", "Georgia", "Washington D.C."] ```

### Default Settings: List slicing

It might be useful to take out every even index from a list. 


Suppose you have a list of time data for example.
There is an optional arugment we can use when slicing a list, the step.
In general ```list[start: end: step]``` with defaults of step=1

In [None]:
School_locations = ["California", "Illinois", "North Carolina", "Texas", "Georgia", "Washington D.C."]  
 
School_locations[::2]

### Operations and manipulation on lists


 - We can insert items into lists!
         - either at the end
         - inserted in the middle
         
 - Count how many items with a specific value
 
 
 - Sort 
  

How do we do this? By using methods.
*Methods* are particular built-in functions that work on objects in python. There are specific methods that work for all *list* objects! 



Methods take the form 
```python 
list.method()
```


Built-in functions can also be applied to objects in python. Recall they take the form 

```python 
function_name(list)
```


### Appending to a list

In [None]:
#you can append an item to the end of a list


School_locations = ["California", "Illinois", "North Carolina", "Texas", "Georgia", "Washington D.C."]  
 
School_locations.append('Michigan')
School_locations

In [None]:
#Or insert a value based on the index
School_locations.insert(4,'tomato')
School_locations

Note we didn't assign School_locations, but this automatically changes it for future.

### How do we find methods?

 - Use online documentation!


https://docs.python.org/3/tutorial/datastructures.html

    
 - Use built-in function ```dir()```   

In [None]:
print(dir(School_locations))

print(dir(list))

A common built-in function:

In [None]:
len(School_locations)

### <mark style="background-color: Thistle">Working with Lists: Short Activity!</mark>

Create a list with the people in your group! Each person contributes at least one list item. (Recall this can be a string, int, float)

With this list, perform the following:

In [None]:
#Define your list here


1. Remove elements 2 and 3

In [None]:
#Code here

2. Append your favorite number to the end of the list

In [None]:
#Code here

3. Insert the string "Math is fun" at index 2

In [None]:
#Code here

4. Did your above operations change your original list, or create a copy?

Answer here

5. (optional) Given list ```number_list = [1,2,3,4,5,6,7]```, create a new list ```new_list = [6,4,2]``` in 2 lines of code. You may have to use list methods.

In [None]:
#Code here

## Grouping Data using Arrays 
 (from the NumPy library)

Arrays contain a sequence of values


 - All elements of an array must have the same type
 
 
   - Why? Numpy arrays were built for efficient computation

 - Can perform operations on all elements in one step
 - When two arrays are added they must have the same size 
   - Corresponds to elementwise addition

### Lists vs Arrays

- Lists are more flexible
    - Can contain elements of different types
    
 
- NumPy arrays have some advantages
    - size - they take up less computer memory than lists
    - performance - faster acces than lists
    - functionality - linear algebra functions built in
    - can be multiple dimensions

### Creating an array


First import the NumPy library!!

```python 
import numpy as np

np.array([])
```

In [None]:
import numpy as np

np.array([1,2])

In [None]:
tomato_list = [22, 38, 26, 35, 35,'tomato']
print(tomato_list) 

In [None]:
tomato_array = np.array([22, 38, 26, 35, 35,'tomato'])
print(tomato_array)

A list has different types, an array will default to one, i.e. all ints were changed to strings


An array will make sure everything is the same type

### Exploring arrays

In [None]:
#create an array - converting a python list to a numpy array
prime50_array = np.array([2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47])

print(type(prime50_array))

What can we explore using arrays?

 - Extraction and Slicing (extracting elements and sub-arrays)
 
 - Attributes (characteristics of arrays: shape, ndim, size)
 
 - Methods and Built-in Functions

 - Manipulation of Arrays (Operations, reshaping, ...)

Extraction and slicing of one dimensional arrays work exactly the same as lists!

In [None]:
print(prime50_array[1]) #extracts second element

prime50_array[1:2] #starting at index 1 and up to (but excluding) index 2

### Arrays have attributes

Characteristics of the object!
(See https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)

- Size ---> .size

- Shape ---> .shape

In [None]:
prime50_array.size

In [None]:
prime50_array.shape

### Arrays have useful methods

 -  .sum()
    
 -  .reshape()

 -  .nonzero()
 
 
 (very incomplete list)
 
(see the documentation for a complete list: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)


In [None]:
print(prime50_array.sum())
#We cannot always sum a list becuase a list can have different data types!


## Arrays have many useful built-in functions!


Notice there is often more than one way to do a common operation!

In [None]:
print(prime50_array.sum()) #sum method

print(np.sum(prime50_array)) #sum built in function (from NumPy library)
      
print(np.count_nonzero(np.array([1,2,0,2,1,0,2])))

print(np.mean(prime50_array))

### We can easily create arrays by specifying a range.

Calling ```np.arange()``` creates a half-closed interval \[start,end) - the end value is not included



In [None]:
np.arange(4,10)

In [None]:
#if you leave out the start, the default is zero; 
print(np.arange(10))

We can specify a step size we want to increment by. If we leave out the step, the default is one


In [None]:
print(np.arange(1,31,2))

## Another reason why arrays are useful!

Elementwise operations!

In [None]:
array_1 = np.arange(10)
array_2 = np.array([1,2,3,4,5,6,7,8,9,10])
difference_array = array_1 - array_2
difference_array

## <mark style="background-color: Thistle"> Code comprehension: Multiple Choice </mark>

#### What will be printed?

```python 
import numpy as np
a = np.array([1,2,3,5,8])
b = np.array([0,3,4,2,1])
c = a + b
c = c*a
print (c[2])
```

A. 7

B. 12

C. 10

D. 21

E. 28

## <mark style="background-color: Thistle"> Code comprehension: Multiple Choice </mark>

What will be output for the following code? 

```python 
 number_array = np.array([1,2,3,5,8])
 number_array = number_array + 1
 print(number_array[1])
```

 A. 0
    
 B. 1

 C. 2
    
 D. 3

## <mark style="background-color: Thistle">Working with Arrays: Short Activity!</mark>

Use the following array to answer the questions


In [None]:
random_number_array = np.array([32, 56, 78, 3, 15, 109, 13, 24, 58, 61, 90, 93, 45, 21, 46])

1. Remove elements 2 and 3

In [None]:
#code here

2. Use a method to find the minimum value in the array

In [None]:
#code here

3. Find the 4th smallest element in the array

In [None]:
#code here

## Dictionaries

A **dictionary** is a set of "key: value" pairs

- keys must be unique

Create a dictionary with square brackets "{}" 

Entries of a dictionary are of the form "key: value" 



In [None]:
survey_dict = {0: "Strongly Disagree", 1: "Disagree", 2: "No opinion", 3: "Agree", 4: "Strongly Agree"}

Access elements of dictionary by key!

In [None]:
survey_dict[1]

Dictionaries are useful in storing and extracting!

A few useful operations with dictionaries:

 - Add an entry
 
 - Delete an entry

Add and delete pairs!

In [None]:
del survey_dict[1]

In [None]:
survey_dict

In [None]:
#this adds a key
survey_dict['new_key'] = 'new value'

In [None]:
survey_dict

Note we can also determine if keys are contained in the dictionary

In [None]:
3 in survey_dict

In [None]:
"Disagree" in survey_dict

In [None]:
'new_key' in survey_dict

Keep in mind, keys do not all need to be the same type..although it may make more sense to keep them that way.


In [None]:
survey_dict_2 = {"Strongly Disagree": 0 , "Disagree": 1 , 2: "No opinion", 3: "Agree", 4: "Strongly Agree"}

In [None]:
survey_dict_2["Disagree"]

In [None]:
list(survey_dict_2)

### Dictionary Methods

In [None]:
print(dir(survey_dict_2))

In [None]:
survey_dict.keys()

In [None]:
survey_dict

## <mark style="background-color: Thistle">Working with Dictionaries: Short Activity!</mark>

Below is a dictionary containing total number of homicides in the United States in 2021, by state. (Published by Statista Research Department, Oct 14, 2022). Note Washington D.C. is included as 'District of Columbia'.

1. Pick a few states of interest and find their homicide number (use code here...do not just manually search the dictionary!)


2. Find the number of keys in the dictionary. Does this imply all 50 states are included here?


3. What are some limitations to this data?


In [None]:
homicide_dict = {'Texas': 2064, 'North Carolina': 928, 'Ohio': 824, 'Michigan': 747, 'Georgia': 728, 'Tennessee': 672, 'Missouri': 593, 'Virginia': 562, 'South Carolina': 548, 'Illinois': 514, 'Pennsylvania': 510, 'Louisiana': 447, 'Indiana': 438, 'Alabama': 370, 'Kentucky': 365, 'Colorado': 358, 'Washington': 325, 'Arkansas': 321, 'Wisconsin': 315, 'Oklahoma': 284, 'Nevada': 232, 'Minnesota': 203, 'Arizona': 190, 'Oregon': 188, 'New Mexico': 169, 'Mississippi': 149, 'Connecticut': 148, 'Maryland': 138, 'New Jersey': 137, 'Massachusetts': 132, 'New York': 124, 'California': 123, 'District of Columbia': 109, 'West Virginia': 95, 'Delaware': 94, 'Kansas': 87, 'Utah': 85, 'Iowa': 70, 'Rhode Island': 38, 'Idaho': 36, 'Montana': 31, 'South Dakota': 26, 'Nebraska': 25, 'Alaska': 18, 'Maine': 18, 'Wyoming': 17, 'New Hampshire': 14, 'North Dakota': 14, 'Vermont': 8, 'Hawaii': 6}

In [None]:
#code here

## Grouping Data with DataFrames 
(from the Pandas library)

### Different objects for different goals

Method 1: (Using lists from python (no import numpy needed))
- list of lists
- Hard to manipulate


Method 2: (Using np.array)
- All values of same data type
- Easy to do math and matrix manipulations
- No row column names


Method 3: (Using pandas DataFrames)
- Columns can have different types
- Easy to manipulate by name
- row and column names built in!

### DataFrames

- Gives rows and columns of data
- Rows are "individuals" or "instances"
- Columns are attributes of those individuals

### Storing Data in 3 ways

Suppose we take a survey of student ID, favorite number, and favorite food

In [None]:
list_of_lists=[['Pizza','Pierogi','Ramen'],  #fav food
               [0,22,-3.1415],  #fav number
               [1234, 4456, 5882]]   #Student ID

list_of_lists

In [None]:
#This is a 2D array
np.array([['Pizza','Pierogi','Ramen'],  #fav food
               [0,22,-3.1415],  #fav number
               [1234, 4456, 5882]]) #student id

In [None]:
#igore this code for now, dataframe creation will be dicussed in more detail on Tuesday!
df = pd.DataFrame(
    { 1:['Pizza','Pierogi','Ramen'],  #fav food
      2:[0,22,-3.1415], #fav number
     3:[1234, 4456, 5882]}) #id

df

### Additional material


We often find that we want to format the output of our code in a nice way. This includes printing string and code output. In Python, there are multiple ways to do this. Two options are listed below. Consider our prime50_array for this exercise.

In [None]:
prime50_array 

In [None]:
#Option 1: print string and code seperated by a comma
print("array:", prime50_array) 

print("max:", np.max(prime50_array)) #we can apply functions directly to code output


#Option 2: print string and code by converting everything to a string and concatenating
print("Maximum prime under 50 is " + str(np.max(prime50_array)))

Question: Take the last output and add a period in the code output.