 # Data Visualization and Pandas Bootcamp
 By: Adrian Garcia<br>
 UCSC: AM-170B

## Data Structures

- A container for holding several data objects. Data structures allow you to handle large amounts of data and keep everything organized.

- Some standard examples include:
    - Lists
    - Sets
    - Tuples
    - Dictionaries

- Data Frames

## What are Data Frames?
- A type of data structure that organizes data into a 2D table of rows and columns, often of varying data type.
- These rows and columns are typically named (e.g rows = samples, columns = characteristics)
- However, some other examples include:
    - rows = items, columns = properties
    - rows = observations, columns = variables
    - etc...

## Example: Creating a Data Frame

In [1]:
# Import pandas package
import pandas as pd
# Create Data
data = pd.DataFrame({'a':[1, 2, 3],
                     'b':[1.0, 2.0, 3.0],
                     'c':['1', '2', '3']})
data

Unnamed: 0,a,b,c
0,1,1.0,1
1,2,2.0,2
2,3,3.0,3


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int64  
 1   b       3 non-null      float64
 2   c       3 non-null      object 
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes


## Breaking Down `.info()`
- `int64` is numeric integer values
- `object` strings (letters and numbers)
- `float64` floating-point values


In [3]:
data.dtypes

a      int64
b    float64
c     object
dtype: object

## Loading Data
- A much more common practice is loading in an existing data set. To do so, we implement attribute `.read_csv()`
<br>
<br>

<font color='red'>NOTE</font>: `.read_csv()` has quite a few optional arguements for data cleaning. To see the full list of arguements, see https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

## Example: Loading Data

In [4]:
# Load Data
df = pd.read_csv('diamonds.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


# Data manipulation in Pandas
- Once a data set is loaded in, we can do quite a few things to help us navigate through the data set.

## Introduction
- For example:

In [5]:
df.head() # prints the first 5 rows

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


- We may also show the first "X" rows:

In [6]:
df.head(10) # prints the first 10 rows

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


- As well as show the last "X" rows

In [7]:
df.tail(10) # prints the last 10 rows

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
53930,0.71,Premium,E,SI1,60.5,55.0,2756,5.79,5.74,3.49
53931,0.71,Premium,F,SI1,59.8,62.0,2756,5.74,5.73,3.43
53932,0.7,Very Good,E,VS2,60.5,59.0,2757,5.71,5.76,3.47
53933,0.7,Very Good,E,VS2,61.2,59.0,2757,5.69,5.72,3.49
53934,0.72,Premium,D,SI1,62.7,59.0,2757,5.69,5.73,3.58
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.5
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74
53939,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64


## Indexing Data

- To access a column, we may either use dictionary-like indexing

In [8]:
df['cut'] # prints the 'cut' column

0            Ideal
1          Premium
2             Good
3          Premium
4             Good
           ...    
53935        Ideal
53936         Good
53937    Very Good
53938      Premium
53939        Ideal
Name: cut, Length: 53940, dtype: object

- or by attribute

In [9]:
df.cut # prints the 'cut' column

0            Ideal
1          Premium
2             Good
3          Premium
4             Good
           ...    
53935        Ideal
53936         Good
53937    Very Good
53938      Premium
53939        Ideal
Name: cut, Length: 53940, dtype: object

- However, to access a row, we index its `loc` attribute.

In [10]:
df.loc[0] # prints the first row

carat       0.23
cut        Ideal
color          E
clarity      SI2
depth       61.5
table       55.0
price        326
x           3.95
y           3.98
z           2.43
Name: 0, dtype: object

- To access a cell, we combine both attributes

In [11]:
df.cut.loc[0] # prints the first cell of the 'cut' column

'Ideal'

- or use a combination of dictionary-like indexing and the `loc` attribute

In [12]:
df['cut'].loc[0] # prints the first cell of the 'cut' column

'Ideal'

- To find the size of the a Data Frame, we may write:

In [13]:
df.shape[0] # prints the amount of rows

53940

In [14]:
df.shape[1] # prints the amount of columns

10

## Manipulating Data
- Question: How do we transform/add/delete data?

## Working Example: Transforming Data

- Let's first start by changing one cell of the original Data Frame:

In [15]:
df.price.loc[1] = 0 # changes the 2nd cell of 'price'
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.price.loc[1] = 0 # changes the 2nd cell of 'price'


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,0,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


- If you do not want to manipulate the original Data Frame:

In [16]:
price = df.price.copy() # copies the 'price' column
price[1] = 10 # changes the second cell of 'price'
df.price.head() # show the first 5 rows of the 'price' column

0    326
1      0
2    327
3    334
4    335
Name: price, dtype: int64

In [17]:
price.head()

0    326
1     10
2    327
3    334
4    335
Name: price, dtype: int64

- To ignore warnings, as well as change multiple cells, we implement:

In [18]:
# Import warnings package
import warnings
warnings.filterwarnings('ignore')
# Changes the 2nd, 3rd, and 4th cell of the 'price' column
df.price[[1,2,3]] = [100,100,100]
df.price.head()

0    326
1    100
2    100
3    100
4    335
Name: price, dtype: int64

## Working Example: Adding Data

- To add a column:

In [19]:
df['price per carat'] = df.price/df.carat # add 'price per carat' column
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,price per carat
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,1417.391304
1,0.21,Premium,E,SI1,59.8,61.0,100,3.89,3.84,2.31,476.190476
2,0.23,Good,E,VS1,56.9,65.0,100,4.05,4.07,2.31,434.782609
3,0.29,Premium,I,VS2,62.4,58.0,100,4.2,4.23,2.63,344.827586
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,1080.645161


<font color='red'>NOTE</font>: we cannot use the attribute indexing method to add a column:

In [20]:
df.year = 2023
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,price per carat
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,1417.391304
1,0.21,Premium,E,SI1,59.8,61.0,100,3.89,3.84,2.31,476.190476
2,0.23,Good,E,VS1,56.9,65.0,100,4.05,4.07,2.31,434.782609
3,0.29,Premium,I,VS2,62.4,58.0,100,4.2,4.23,2.63,344.827586
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,1080.645161


In [21]:
df.year

2023

- To add a row, we may use `.append()`:

In [22]:
newSample1 = {'carat':0.10,'cut':'Ideal','color':'J','clarity':'SI1','depth':60.0,'table':58.0,'price':100,'x':3.89,'y':3.98,'z':2.63,'price per carat':1000}
df = df.append(newSample1,ignore_index = True) # add a new row
df.tail()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,price per carat
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61,3829.166667
53937,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56,3938.571429
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74,3205.813953
53939,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64,3676.0
53940,0.1,Ideal,J,SI1,60.0,58.0,100,3.89,3.98,2.63,1000.0


## Working Example: Deleting Data

- To delete rows or columns, we may use `.drop()` **which drops rows by default**:

In [23]:
df.drop(df.index[-1],inplace = True) # drop the last row
df.tail(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,price per carat
53937,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56,3938.571429
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74,3205.813953
53939,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64,3676.0


In [24]:
df.drop(columns = ['price per carat'],inplace = True) # drop 'price per carat'
df.head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,100,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,100,4.05,4.07,2.31


## Operations

- As expected, we may use all arithmetic operations `(+,-,*,/,etc.)` on the columns of a Data Frame
- For example:

In [25]:
xyz1 = (df.x + df.y) * (df.z / (df.y - df.x))
xyz1.head(3)

0    642.330
1   -357.126
2    937.860
dtype: float64

In [26]:
xyz2 = (df.x.loc[0] + df.y.loc[0]) * (df.z.loc[0] / (df.y.loc[0] - df.x.loc[0]))
xyz2

642.3300000000042

- If, however, we have missing data values in some places and we don't want the operation to continue with the `NaN` label, we may implement commands `.add()`, `.sub()`, `.mul()`, `.div()`, etc. with argument `fill_value = 0`

- For example:

In [30]:
# Import numpy package
import numpy as np
df.y.loc[0] = np.nan # sets the first cell to be a NaN
df.y.head(3)

0     NaN
1    3.84
2    4.07
Name: y, dtype: float64

In [28]:
xyz3 = df.x + df.y
xyz3.head(3)

0     NaN
1    7.73
2    8.12
dtype: float64

In [29]:
xyz4 = df.x.add(df.y,fill_value = 0)
xyz4.head(3)

0    3.95
1    7.73
2    8.12
dtype: float64

## Sorting

## Missing Data

## Data Summary

## Merging/Joining `DataFrames`

## Writing Data to Files

# Data Visualization

## Uses for Data Visualization

## Plot Selection

## Single Variable

## Two Variables

## >Two Variables

## General Tips

## Statisical Analysis