Testing that python interpreter is working before we start.

In [2]:
print("Hello friend! Welcome to Pandas!")

Hello friend! Welcome to Pandas!


---
* ### How Models Work

Machine learning models are built to **find patterns in data** and use those patterns to make predictions or decisions.

When a model is given data to learn from, it goes through a process called **training** (or **fitting**).  
The data used for this purpose is known as **training data**.

#### Example: Decision Tree
A **Decision Tree** predicts outcomes by splitting data into branches based on feature values.  
Each split helps the model make a more precise decision.  
- Trees with **more branches (splits)** can handle more complex patterns — they are called **“deeper trees.”**  
- The **final nodes** in a tree, where a decision or prediction is made, are called **“leaves.”**



---

To start, import the pandas library as pd.

In [1]:
import pandas as pd   

* ### Basic Data Exploration

Data-frame: The main data structure of the python pandas library. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. 

You can create it and put in some dummy data like seen on the second code block in the rough1.ipynb file and you can do lots of work on the data using libraries like Pandas and NumPy (but ofc not limited to them).

There are multiple ways to read data and then store it in a **DataFrame** for usage.  

1. From Files

| Source | Function | Example |
|:--------|:-----------|:---------|
| ***CSV files** | `pd.read_csv()` | `df = pd.read_csv('data.csv')` |
| **Excel files** | `pd.read_excel()` | `df = pd.read_excel('data.xlsx', sheet_name='Sheet1')` |
| **Text files** | `pd.read_table()` | `df = pd.read_table('data.txt', delimiter='\t')` |
| **JSON files** | `pd.read_json()` | `df = pd.read_json('data.json')` |
| **HTML tables** | `pd.read_html()` | `df_list = pd.read_html('https://example.com')` |
| **XML files** | `pd.read_xml()` | `df = pd.read_xml('data.xml')` |
| **Pickle files** | `pd.read_pickle()` | `df = pd.read_pickle('data.pkl')` |
| **Parquet files** | `pd.read_parquet()` | `df = pd.read_parquet('data.parquet')` |
| **Feather files** | `pd.read_feather()` | `df = pd.read_feather('data.feather')` |



2. From Databases

| Source | Function | Example |
|:--------|:-----------|:---------|
| **SQL databases** | `pd.read_sql()` | `df = pd.read_sql('SELECT * FROM table', conn)` |
| **Using SQLAlchemy** | `pd.read_sql_table()` | `df = pd.read_sql_table('table_name', engine)` |
| **SQLite (local DB)** | `sqlite3` + Pandas | `conn = sqlite3.connect('data.db')` <br> `df = pd.read_sql_query("SELECT * FROM users", conn)` |



3. From APIs / Online Sources

| Source | Function | Example |
|:--------|:-----------|:---------|
| **Direct URL (CSV/JSON)** | `pd.read_csv('https://...')` | `df = pd.read_csv('https://example.com/data.csv')` |
| **API response (JSON)** | `pd.DataFrame()` from requests | `import requests; data = requests.get(url).json(); df = pd.DataFrame(data)` |



4. From Created Python Objects

| Source | Function | Example |
|:--------|:-----------|:---------|
| **Dictionary** | `pd.DataFrame()` | `df = pd.DataFrame({'Name': ['John', 'Anna'], 'Age': [28, 24]})` |
| **List of Lists / Tuples** | `pd.DataFrame()` | `df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])` |
| **NumPy Array** | `pd.DataFrame()` | `import numpy as np; df = pd.DataFrame(np.random.rand(3, 3), columns=['X', 'Y', 'Z'])` |
| **Another DataFrame** | `pd.DataFrame(existing_df)` | Useful for copies or transformations |



Exploring data-frame terms and such using method 4 (from methods listed above),

In [19]:
#Simple DataFrame with dummy data (2-D Array)
df= pd.DataFrame([['a',2,'c'] , [4,'e',6], ['i',8,'k']])

#Columns is to alter column names and Index is to alter row names
fd=pd.DataFrame([[11,12,13] , [14,15,16], [17,18,19]], columns=['one','two','three'] , index=['a','b','c'])


---
"describe()" gives some meaningful information about the data like,
1. Numeric Data:
    - Count = The number of items in each column.

    - Mean = The mean (sum of all the values/number of rows) of the items in each column.

    - std = Standard Deviation (signifies how spread out the data is from the average and is calculated by:  
        * **Population Standard Deviation ($\sigma$)**
        $$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}$$
        *Used when your data set includes **every member** of the population.*
        * **Sample Standard Deviation ($s$)** (Used by describe() by default)
        $$s = \sqrt{\frac{\sum (x_i - \text{Mean})^2}{N-1}}$$
        *Used when your data set is only a **subset (a sample)** of a larger group.*
    
    - min = The smallest value
    - max = The largest value
    - 25% - Also known as the **First Quartile**. 
    - 50% -
    - 75% -

In [None]:
#Here, fd contains numeric data and thus will be used for descriptive statistics
print(fd)
print()

#Descriptive statistics (basically print the summary of the data)
fd.describe()

   one  two  three
a   11   12     13
b   14   15     16
c   17   18     19



Unnamed: 0,one,two,three
count,3.0,3.0,3.0
mean,14.0,15.0,16.0
std,3.0,3.0,3.0
min,11.0,12.0,13.0
25%,12.5,13.5,14.5
50%,14.0,15.0,16.0
75%,15.5,16.5,17.5
max,17.0,18.0,19.0


In [21]:
#To see the first 5 rows of the DataFrame
print(fd.head())


#Summary of the DataFrame
fd.info()

#int64=64-bit=8 bytes

   one  two  three
a   11   12     13
b   14   15     16
c   17   18     19
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   one     3 non-null      int64
 1   two     3 non-null      int64
 2   three   3 non-null      int64
dtypes: int64(3)
memory usage: 96.0+ bytes


In [None]:
#Print the column names
print(df.columns)

   0  1  2
0  a  2  c
1  4  e  6
2  i  8  k

RangeIndex(start=0, stop=3, step=1)


In [6]:
 #To see the first 2 rows of the DataFrame
print(fd.head(2))

print()

#Print the index (row names)
print(fd.index)
#Print the column names
print(fd.columns)


   one  two  three
a   11   12     13
b   14   15     16

Index(['a', 'b', 'c'], dtype='object')
Index(['one', 'two', 'three'], dtype='object')


In [7]:
#Load data 
data={
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']   
}
sd=pd.DataFrame(data)

print(sd.head())

    Name  Age      City
0   John   28  New York
1   Anna   24     Paris
2  Peter   35    Berlin
3  Linda   32    London


In [8]:
#Print a specific row by index
print(sd.loc[0])

print()

#Print a specific list of indexes
print(sd.loc[[0,2]])

print()


Name        John
Age           28
City    New York
Name: 0, dtype: object

    Name  Age      City
0   John   28  New York
2  Peter   35    Berlin

