<center> 

# What is Data Science?

## Amanda Kube
## CIS 299 - Spring 2023 
    
</center>

<center> 
<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/UChicago_DSI.png" alt="UC-DSI" width="600" height="700">
</center>

<center>
<img src="https://images.footballfanatics.com/harry-s-truman-college-falcons/harry-s-truman-college-falcons-20-x-20-retro-logo-circle-sign_pi5110000_ff_5110552-aba43bc89b43218bc84f_full.jpg?_hv=2" alt="TRUMAN" width="200" height="200">
</center>

## Today

* Lists
* Arrays (Numpy)
* DataFrames (Pandas)

Python and its libraries provide many ways to group data together.   

Some important ones (listed in order of increasing functionality and sophistication):
* Lists (built into Python)
* Arrays (found in the numPy library)
* DataFrames (a table, found in the pandas library)

In general you should use the simplest one that meets your needs...

We will use numPy and pandas often so let's install them

Run the following in your commandline/terminal:

`pip install numpy`

`pip install pandas`

(may need pip3)

To use these in jupyter, we need to import them:

In [2]:
import numpy as np
import pandas as pd

### Lists

<img style="float: right;" src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/python.png" alt="Python Logo" width="200" height="200">

A `list` is an ordered sequence of values.  

Create with square brackets `[]` or `list()` function:

In [3]:
list1 = [1, 2, 3, 4]
list1

[1, 2, 3, 4]

In [8]:
list2 = list("abcdef")
list2

['a', 'b', 'c', 'd', 'e', 'f']

An important (interesting) thing about lists is that values can be of different types:

In [10]:
list3 = [2+3, "four", [5, 6]]
list3

[5, 'four', [5, 6]]

In [11]:
list4 = list1 + list2 + list3
list4

[1, 2, 3, 4, 'a', 'b', 'c', 'd', 'e', 'f', 5, 'four', [5, 6]]

### Other Python Collections

<img style="float: right;" src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/python.png" alt="Python Logo" width="200" height="200">

* Tuples - Similar to lists but cannot be changed
* Sets – unordered set of unique values (like a set in math)
* Dictionaries – used to store <key, value> pairs 
  * keys must be unique (handy for look ups)

We will (mostly) not use these

Let's create a list of fruit - use square brackets

In [12]:
fruits = ['apple', 'banana','cherry','durian']
fruits

['apple', 'banana', 'cherry', 'durian']

You can select items from a list (starting with 0)

In [13]:
fruits[1]

'banana'

You can also extract a "slice" of a list

Up to but not including the end of the slice

In [14]:
fruits[1:3]

['banana', 'cherry']

Slicing for 1d arrays is `[start: end: step]` with defaults 0, size, 1

In [15]:
fruits[::2]

['apple', 'cherry']

You can change the size of a list

In [16]:
fruits.append(5)
fruits

['apple', 'banana', 'cherry', 'durian', 5]

You can choose where to append new items as well


In [17]:
fruits.insert(4,'tomato')
fruits

['apple', 'banana', 'cherry', 'durian', 'tomato', 5]

### Arrays

<img style="float: right;" src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/NumPy.png" alt="NumPy Logo" width="200" height="200">

`Arrays` also contain a sequence of values

All elements of an `array` must have the *same type*
* Enables a more efficient implementation than `lists`



<img style="float: right;" src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/NumPy.png" alt="NumPy Logo" width="200" height="200">

Arithmetic is applied to each element individually

When two `arrays` are added, they must have the same size; corresponding elements are added in the result

Arrays “form the core of nearly the entire ecosystem of data science tools in Python”

<font color='red'>Arrays are not built-in to Python, we get them from the NumPy library</font>

### Ranges (in numpy arrays)

<img style="float: right;" src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/NumPy.png" alt="NumPy Logo" width="200" height="200">

A range is an array of consecutive numbers:

`np.arange(end)`:  An array of increasing integers from 0 up to `end`

`np.arange(start, end)`: An array of increasing integers from `start` up to `end`

`np.arange(start, end, step)`: A range with `step` between consecutive values

The range always includes `start` but excludes `end`


### Lists vs Arrays

* Lists are more flexible:
  * Re-sizeable and can contain elements of different types.


* Numpy arrays have some other advantages:
  * Size – They take up less computer memory than lists
  * Performance - They faster to access than lists
  * Functionality - SciPy and NumPy have optimized functions such as linear algebra operations built in.

The advantages of arrays become more noticeable as the number of values you are storing in them increases
* “Big Data”

Now lets create an array - converting a python list to a numpy array


In [None]:
my_array = np.array([1,2,3,4])
print(my_array)

Two ways to get the sum of elements in the array

In [None]:
print(sum(my_array))
print(my_array.sum())

Add two arrays

In [None]:
my_array2 = np.array([5,6,7,8])
my_array + my_array2

Two ways to get the number of elements in the array

In [None]:
print(my_array.size)
print(len(my_array))

Arrays require items to all be of the same type.  Lists allow different types.


In [None]:
my_other_list = [1,2,3,4,'banana']
my_other_list

Convert float to string

In [None]:
my_other_array = np.array([1.0,'this is kinda confusing'])
print(my_other_array)

There are many methods that can be applied to arrays

In [None]:
np.count_nonzero(np.array([1,2,0,2,1,0,2]))

Ranges can be very handy, for creating data.  The method 'arange()' creates a half-closed interval \[start,end) - the end value is not included

In [None]:
print(np.arange(4,10))

If you leave out the start, the default is zero; 
If you leave out the step, the deault is one

In [None]:
print(np.arange(10))

In [None]:
print(np.arange(1,31,2))

## DataFrames

* Rows and columns of data
  * Rows are the "individuals” or "instances”
  * Columns are attributes of those individuals

For example – a student dataframe with name, StudentID, GPA

Similar ideas exist in systems such as R or as “tables” in SQL databases


## Data From The Titanic

I just watched the move *Titanic* and am wondering how close the depiction is to the real tragedy.

<center>

<img src="https://media1.popsugar-assets.com/files/thumbor/7CwCuGAKxTrQ4wPyOBpKjSsd1JI/fit-in/2048xorig/filters:format_auto-!!-:strip_icc-!!-/2017/04/19/743/n/41542884/5429b59c8e78fbc4_MCDTITA_FE014_H_1_.JPG" alt="Titanic Film">
   
</center>


<center>

<img src="https://upload.wikimedia.org/wikipedia/commons/6/6e/St%C3%B6wer_Titanic.jpg" alt="Titanic">
   
</center>

## Our Data - A DataFrame!

Pandas assigns an “index” for each row


In [1]:
import pandas as pd
titanic = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Ways to create a DataFrame

`pd.DataFrame()` creates an empty dataframe

In [15]:
pd.DataFrame(
    [[4,5,6],[7,8,9],[10,11,12]],
    index = [1,2,3],
    columns=["a","b","c"])

Unnamed: 0,a,b,c
1,4,5,6
2,7,8,9
3,10,11,12


an alternative syntax

In [16]:
pd.DataFrame(
    {"a": [4,5,6],
     "b": [7,8,9],
     "c": [10,11,12]},
    index = [1,2,3])

Unnamed: 0,a,b,c
1,4,7,10
2,5,8,11
3,6,9,12


or if you leave out the row index Pandas creates it by default

In [17]:
df2 = pd.DataFrame(
    {"a": [4,5,6],
     "b": [7,8,9],
     "c": [10,11,12]})
df2

Unnamed: 0,a,b,c
0,4,7,10
1,5,8,11
2,6,9,12


## DataFrames consist of Series

In Pandas a “Series” is a 1d array of indexed data

Similar to a 1d numpy array but allows indexes to be explicitly defined with the data (and they don’t need to be integers)

A dataframe is a sequence of aligned series objects
“aligned” means that they share the same index


`pd.read_csv(filename)` reads a table from a spreadsheet (.csv file)


In [20]:
#import pandas as pd
titanic = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
titanic.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


What columns are there in the table?

In [21]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Maybe we'd prefer a list?

In [22]:
titanic.columns.tolist()

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

How many rows?

In [None]:
len(titanic)

Or we could ask for both

In [None]:
titanic.shape

Let's look at a column

In [None]:
titanic.Fare

What was the average price of a ticket?

In [None]:
titanic.Fare.mean()

Or alternately we can look at a column this way

In [None]:
titanic['Fare']

And take the average

In [None]:
np.mean(titanic['Fare'])

This notation is handy if you want to extract multiple columns

In [None]:
titanic[['Pclass','Fare']]

Note, none of the above has altered our original titanic dataframe

In [2]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


If you want to keep the result with fewer columns you need to assign it to a variable

In [None]:
titanic_small = titanic[['Pclass','Fare']]
titanic_small

In the above, we chose columns. We can also choose rows by slicing

In [3]:
titanic[:3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


Choosing rows based on values can be done with `loc`

In [None]:
titanic.Pclass == 3

In [None]:
titanic.loc[titanic.Pclass == 3]

You can do row selction with multiple conditions

In [None]:
titanic.loc[(titanic.Pclass == 3) & (titanic.Survived == 1)]

You can also use this alternative syntax

In [None]:
titanic.loc[titanic['Pclass'] == 3]

We can sort rows by values

Default sort is ascending

In [None]:
titanic.sort_values(by='Fare')

We can override that default if you like

In [None]:
titanic.sort_values(by='Fare', ascending=False)

You don't have to sort by numeric values...

In [None]:
titanic.sort_values(by='Name', ascending=False)

## Our repertoire: 

* Arithmetic Operations
* Comparisons
* Assignment Statements
* Call Expressions
* Arrays
* Lists
* DataFrames