# Pandas

Pandas is a Python library used for working with data ets

Pandas = "Panel Data" + "Python Data Analysis"

- FInd correlations between two or more columns
- Find average values
- Min/Max
- Delete rows of bad or empty data. i.e: data cleaning

Each series is like a column in a table

```
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

calories = {"day1: 420}
```

## DataFrames

If a series is a column, then a DataFrame is the whole table

```
import pandas as pd
data ={
    "calories": [420, 380, 390],
    "duration": [50, 40, 45]
}

myvar = pd.DataFrame(data)
print(myvar)
```

## Selecting Rows

```
import pandas as pd
data ={
    "calories": [420, 380, 390],
    "duration": [50, 40, 45]
}

df = pd.DataFrame(data)

print(df)
print(df.loc[0])
print(df.loc[[0,1]])
```

## Naming Indexes

```
df = pd.DataFrame(data, index=["day1", "day2"])
```

## Read Data from Files (CSV)
If we have a large dataset, pandas will only return the first 5 and last 5 records.

Use to_string() to print the entire DataFrame

Pandas has built in settings that alter its behavior

This will set the max row the data frame must have before it will cut off records in print statements

JSON works similarly to CSV files

Dictionaries can be loaded into Pandas directly
```
import pandas as pd
df = pd.read_csv("data.csv")
print(df.to_string())

pd.options.display.max_rows = 9999
df = pd.read_csv("data.csv")

df = pd.read_json("data.json")
print(df.to_string())
```

## Heads or Tails

head(x) will grab x amount of values from the top, tail(x) will grab from the bottom

If no x is passed, it defaults to 5 values

```
import pandas as pd
df = pd.read_csv("Data.csv")

print(df.head(10))
print(df.tail())
```

## Info

```
print(df.info())
```

## Cleaning Empty Cells

THESE METHODS ONLY WORK IF THE FIELD IS NaN, None, or pd.na.

Empty strings are skipped

Get a copy with no nulls:
```
df = pd.read_csv("data.csv")
new_df = df.dropna()
```

Alter original:
```
df = pd.read_csv("data.csv")
df.dropna(inplace = True)
```

Replace nulls with a value:
```
df = pd.read_csv("data.csv")
df.fillna(130, inplace = True)
```

Replace only for a certain column:
```
df = pd.read_csv("data.csv")
x = df["Calories"].mode()[]
df.fillna({"Calories": x}, inplace = True)
```
Note: mean() or median() would also work here

## Cleaning Wrong Format
```
import pandas as pd
df = pd.read_csv("data.csv")
df["Date"] = pd.to_datetime(df["Date"], format="mixed")
print(df.to_string())

# Strip whitespaces and standardize text
df["title"] = df["title"].str.strip()
df["author"] = df["author"].str.title()

# Clamp ratings to a valid range(0, 5)
df["average_rating"] = df["average_rating"].clip(0, 5)
```

## Fixing Wrong Data
Set a specific item:
``` df.loc[7, "Duration"] = 45 ```

Loop through the DataFrame and conditionally change value:
```
for x in df.index:
    if df.loc[x, "Duration"] > 120
        df.loc[x, "Duration"] = 120:
```

## Removing Duplicates
Get booleans for any time that is a duplicate
```
df = pd.read_csv("data.csv")
print(df.duplicated())
```

Drop any duplicates
``` df.drop_duplicates(inplace = True) ```

## Data Correlations
```
import pandas as pd
import numpy as np

df = pd.read_json("books.json")
numericOnlyDataFrame = df.select_dtypes(include=[np.number])
numericOnlyDataFrame = numericOnlyDataFrame.loc[
    :, numericOnlyDataFrame.nunique(dropna = True) > 1]
print(numericOnlyDataFrame.corr())
```

corr() can only work with numeric types.

Can also do:
```
df = pd.read_json("books.json")
print(df.corr(numeric_only=True))
```
1 = perfect correlation (1-1 movement)
-1 = perfect negative correlation (negative movement)
0 = no correlation at all




In [None]:
import pandas as pd
data ={
    "calories": [420, 380, 390],
    "duration": [50, 40, 45]
}

df = pd.DataFrame(data)

print(df)
print(df.loc[0])
print(df.loc[[0]])

   calories  duration
0       420        50
1       380        40
2       390        45
