# Pandas Basics

## What is Pandas?

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

## Installing Pandas (PIP)

Pip is a package manager for Python that automates the process of installing, upgrading, configuring, and removing packages.

`How it works`
Pip uses PyPi as its default repository for fetching packages, but it can also install packages from other sources, such as version control systems, requirements files, and distribution files.

`When it's included`
Pip is included by default in Python version 3.4 or later. If you installed Python from source, with an installer from python.org, or via Homebrew, you should already have pip.

`How to use it`
You run pip from your system's command-line interface, not from within Python itself. To install a package, type pip install and then the name of the package.

For VS Code, this is run in your terminal (and venv if you're using one.)

Syntax `pip install pandas`


## Introduction to Pandas
- Pandas is a Python library used for working with data sets.
- It provides functions for analyzing, cleaning, exploring, and manipulating data.
- Created by Wes McKinney in 2008.
- The name Pandas is derived from **Panel Data** and **Python Data Analysis**.

### Why Use Pandas?
Pandas allows us to analyze large data sets and draw conclusions based on statistical theories. It also helps clean messy data to make it readable and relevant.

### Installation of Pandas
If Pandas is not installed, you can install it using pip:
```
!pip install pandas
```

In [2]:
import pandas as pd
print(pd.__version__)

2.2.3


### Pandas Series
A **Series** is a one-dimensional array holding data of any type. It is similar to a column in a table.

Note that you would only use a series if you are working with one dimensional data. While you can use a dataframe, it uses more memory. 

In [3]:
a = [1, 7, 2] #what type of variable is a?
myvar = pd.Series(a) #call for a series. Also note that pandas is called as pd becuase that's how we imported it. 
print(myvar)

0    1
1    7
2    2
dtype: int64


By default, Pandas labels the Series with index numbers starting from 0. You can also create custom labels using the `index` argument.

In [4]:
a = [1, 7, 2]
myvar = pd.Series(a, index = ['x', 'y', 'z']) #must match the length of the input. 
print(myvar)

x    1
y    7
z    2
dtype: int64


### Pandas DataFrame
A **DataFrame** is a 2-dimensional data structure, like a table with rows and columns.

In [5]:
data = {'calories': [420, 380, 390], 'duration': [50, 40, 45]} #what type is this?
df = pd.DataFrame(data)
print(df)

   calories  duration
0       420        50
1       380        40
2       390        45


In [6]:
#In Jupyter Notebook, you can pretty print the df. With jus the dataframe's name.
df

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45


### Read CSV Files
You can load CSV files directly into Pandas DataFrames using `pd.read_csv()`.

In [7]:
# Example to load CSV file
df = pd.read_csv('data.csv') #note how easy the syntax is in python
print(df.head())

  Month  Sales  Profit  Expenses  Customers
0   Jan    200      20       150        100
1   Feb    250      30       180        120
2   Mar    300      50       200        130
3   Apr    350      60       210        150
4   May    400      70       250        160


Side Note: Here is the pandas source code for this funtion. Look how easy they made it for you!
    
https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/readers.py

### Read JSON Files
You can also load JSON files into Pandas DataFrames using `pd.read_json()`.

In [8]:
# Example to load JSON file
df = pd.read_json('data.json')
print(df.head())

   id       name  price  in_stock     category  rating
0   1  Product_1  98.49      True  Electronics     2.8
1   2  Product_2  43.87     False        Books     1.2
2   3  Product_3  33.23      True     Clothing     1.4
3   4  Product_4  63.88      True     Clothing     4.2
4   5  Product_5  30.03      True  Electronics     2.1


Side Note: Here is the pandas source code for this funtion. Look how easy they made it for you!
    
https://github.com/pandas-dev/pandas/blob/main/pandas/io/json/_json.py

### Analyzing DataFrames
You can get a quick overview of a DataFrame using the `head()` and `tail()` methods to view the first and last rows.

In [9]:
print(df.head(10))
print(df.tail())

   id        name  price  in_stock     category  rating
0   1   Product_1  98.49      True  Electronics     2.8
1   2   Product_2  43.87     False        Books     1.2
2   3   Product_3  33.23      True     Clothing     1.4
3   4   Product_4  63.88      True     Clothing     4.2
4   5   Product_5  30.03      True  Electronics     2.1
5   6   Product_6  66.23     False         Home     4.0
6   7   Product_7  96.98      True        Books     4.5
7   8   Product_8  59.80      True        Books     1.0
8   9   Product_9  91.12      True     Clothing     3.6
9  10  Product_10  72.53      True        Books     2.2
   id        name  price  in_stock  category  rating
5   6   Product_6  66.23     False      Home     4.0
6   7   Product_7  96.98      True     Books     4.5
7   8   Product_8  59.80      True     Books     1.0
8   9   Product_9  91.12      True  Clothing     3.6
9  10  Product_10  72.53      True     Books     2.2


The `info()` method gives you more information about the data, such as the number of entries, columns, and non-null values.

In [10]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        10 non-null     int64  
 1   name      10 non-null     object 
 2   price     10 non-null     float64
 3   in_stock  10 non-null     bool   
 4   category  10 non-null     object 
 5   rating    10 non-null     float64
dtypes: bool(1), float64(2), int64(1), object(2)
memory usage: 542.0+ bytes
None
