# 1 - Pandas & Datasets

Pandas helps us manage datasets and very often using flat files (eg. `csv`, `xlsx`, `tsv`, etc). In this one, we're going to create our first dataset with random data.

### Create Random Data
Below is a simple Python function to generate random data with no external dependencies. 

In [5]:
import random

In [6]:
items = []

random_number = random.randint(0, 50_000)

def float_to_dollars(value):
    # in the future, this will be stored in
    # utils.py in the courses/ directory
    return  f"${value:,.2f}" 


for x in range(0, random_number):
    dollars = random.randint(30_000, 50_000_000)
    data = {
        "Player Name": f"Player-{x}",
        "Player Salary": float_to_dollars(dollars)
    }
    items.append(data)

There's a few questions I want to ask about this data:
- How do we save this data? How do we load saved data?
- How do we clean this data?
- How do we analyze this data?

The answer, of course, is Pandas. So let's see why.

### Initialize a DataFrame
A table of data in Pandas is called a DataFrame. At it's core, a Dataframe is just rows and columns. There are many ways to initialize it. Let's use the data from above to start our first one:

In [7]:
import pandas as pd

df = pd.DataFrame(items)

Pandas uses a common [numpy](https://numpy.org/) convention when importing:
```python
import pandas as pd
```
So in Python projects that use Pandas, you will typically see this import somewhere. You usually won't do `import pandas` or `from pandas import DataFrame`. As with most things in software, there's nothing technically stopping you from doing that; it's just not the common practice.

The variable `df` is very often used for instances of `DataFrame`.

Since a `DataFrame` is a table with columns and rows, you can easily initialize it with a list of data. 

Let's take a look at this data:

In [8]:
df.head()

Unnamed: 0,Player Name,Player Salary
0,Player-0,"$23,564,932.00"
1,Player-1,"$19,122,655.00"
2,Player-2,"$9,926,467.00"
3,Player-3,"$44,055,782.00"
4,Player-4,"$41,113,231.00"


Tables in Pandas can be massive so we use `df.head()` to get a glimpse of the first 5 rows. Use `df.head(n=20)` to change this value. You can also use `df.tail(n=5)` to see the end of this table.

### Exporting a DataFrame (Writing)
There are many ways to save DataFrames. You can save to:

- CSV (Comma Separated Values)
- TSV (Tab Separated Values)
- Excel (`xlsx`)
- JSON (JavaScript Object Notion)
- HDF (HDF5 files)
- HTML (reading/writing HTML tables `<table>`)
- Pickle
- SQL
- And much [more](https://pandas.pydata.org/docs/reference/io.html)

Throughout this course we'll use a mixture of storage options but mostly using `csv` files as they are lightweight and easy to use in many situations. 

So how do we save this?

In [12]:
df.to_csv("example.csv", index=False)

Here are a few other ways to export:


```python
#TSV
df.to_csv('example.tsv', sep='\t', index=False)

#Excel
df.to_excel('example.xlsx', sheet_name='example', index=False)

#JSON
df.to_json('example.json', index=False)

#HDF
df.to_hdf('example.h5', key='example', index=False)

#HTML: 

df.to_html('example.html', index=False)

#Pickle
df.to_pickle('example.pkl', index=False)


#SQL
from sqlalchemy import create_engine
engine = create_engine('sqlite://', echo=False)
df.to_sql('example_table', con=engine, index=True)
```

Now that we have saved our `example.csv` file, how do we load it in? That's just as simple, and it's usually a `read_<filetype>` directly in Pandas.

> A quick note. There are many reasons these different file types exist. One of them, especially in dealing with `csv` files, has to do with date type. More on storing data types later.

### Importing Data (Reading)

Importing data is just as easy as exporting data but instead of using a DataFrame class, we use built in methods for reading. First, here are examples:

```python
#CSV
df = pd.read_csv('example.csv')

#TSV
df = pd.read_csv('example.tsv', sep='\t')

#Excel
df = pd.read_excel('example.xlsx', sheet_name='example')

#JSON
df = pd.read_json('example.json')

#HDF
df = pd.read_hdf('example.h5', key='example')

#HTML
df = pd.read_html('example.html')

#Pickle
df = pd.read_pickle('example.pkl')

#SQL
from sqlalchemy import create_engine
engine = create_engine('sqlite://')
df = pd.read_sql('SELECT * from example_table', con=engine)
```


In [6]:
new_df = pd.read_csv('example.csv')
new_df.head()

Unnamed: 0,Player Name,Player Salary
0,Player-0,"$46,979,630.00"
1,Player-1,"$26,599,432.00"
2,Player-2,"$33,629,681.00"
3,Player-3,"$45,665,631.00"
4,Player-4,"$37,545,118.00"


Now that we can export and import data, how do we clean it up? 

In [None]:
# Export to samples dir
# df.to_csv("samples/1.csv", index=False)