#HW1
##Esteban Velasquez
##114502904

##Q1: What is a data frame in Python data analysis?
In Python data analysis, a DataFrame is a two-dimensional, tabular data structure — similar to a spreadsheet or SQL table — that is provided by the pandas library.

It’s one of the most common and powerful tools for handling structured data.

🔹 Key Characteristics

Rows and Columns:
Data is organized in labeled rows (index) and columns (column names).

Heterogeneous data:
Each column can store a different data type (e.g., integers, floats, strings, dates).

Labeled axes:
Both rows and columns have labels, which makes data selection and manipulation intuitive.

Size mutable:
You can insert or delete columns and rows.

🔹 Example
import pandas as pd

# Create a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['Milan', 'Paris', 'Berlin']
}

df = pd.DataFrame(data)

print(df)


Output:

      Name  Age    City
0    Alice   25   Milan
1      Bob   30   Paris
2  Charlie   35  Berlin

🔹 Common Uses

Reading and writing data (CSV, Excel, SQL, JSON)

Data cleaning (handling missing values, duplicates)

Data transformation and filtering

Descriptive statistics and aggregation

Visualization and model input preparation

🔹 In short:

A DataFrame is the core data structure in pandas for analyzing and manipulating structured data efficiently.

Would you like me to show a quick example of how to filter or analyze data in a DataFrame (e.g., average age by city)?

##Q2: How to load a dataset as a data frame?
To **load a dataset as a DataFrame** in Python, you typically use the **pandas** library.
Pandas provides many functions for reading data from different file formats (CSV, Excel, SQL, JSON, etc.) and converting them into a **DataFrame**.

Here’s a breakdown 👇

---

### 🐍 Step 1: Import pandas

```python
import pandas as pd
```

---

### 📂 Step 2: Load the dataset

#### 1. **From a CSV file**

```python
df = pd.read_csv("data.csv")
```

* `"data.csv"` → path to your file (can be local or a URL)
* Optional parameters:

  ```python
  df = pd.read_csv("data.csv", delimiter=",", header=0, encoding="utf-8")
  ```

---

#### 2. **From an Excel file**

```python
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
```

---

#### 3. **From a JSON file**

```python
df = pd.read_json("data.json")
```

---

#### 4. **From a SQL database**

```python
import sqlite3
conn = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM table_name", conn)
```

---

#### 5. **From a dictionary (manual creation)**

```python
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
```

---

### 🔍 Step 3: Inspect your DataFrame

```python
df.head()      # First 5 rows
df.info()      # Column names, data types, memory usage
df.describe()  # Basic statistics
```

---

### ✅ Example: Loading a CSV

```python
import pandas as pd

df = pd.read_csv("students.csv")

print(df.head())
```

**Output:**

```
   ID   Name  Age   Grade
0   1  Alice   21    A
1   2    Bob   22    B
2   3  Carol   20    A
```

---

Would you like me to show you how to load a dataset from a **URL** or **Google Drive** too?


##Q3: What is summary statistics?
 **Summary statistics** are **numerical measures** that describe and summarize the main features of a dataset.

They give you a **quick overview** of the data — helping you understand its **central tendency, spread, and distribution** without looking at every individual value.

---

### 🔹 Types of Summary Statistics

#### 1. **Measures of Central Tendency**

These describe the **center** of the data:

* **Mean** → average value
* **Median** → middle value when data is sorted
* **Mode** → most frequent value

#### 2. **Measures of Dispersion (Spread)**

These show how **spread out** the data is:

* **Range** → difference between max and min
* **Variance** → average squared deviation from the mean
* **Standard deviation** → square root of variance (how far values typically are from the mean)
* **Interquartile range (IQR)** → difference between the 75th and 25th percentiles

#### 3. **Shape of Distribution**

These describe how the data is distributed:

* **Skewness** → symmetry (left/right leaning)
* **Kurtosis** → “peakedness” or flatness of the distribution

#### 4. **Count & Percentiles**

* **Count** → number of observations
* **Min / Max / Quartiles** → position-based summaries

---

### 🔹 In Python (with pandas)

You can quickly compute summary statistics with:

```python
import pandas as pd

df = pd.read_csv("data.csv")
print(df.describe())
```

**Example Output:**

```
             Age     Salary
count    100.000  100.00000
mean      35.480  55000.230
std        8.260  15000.120
min       22.000  30000.000
25%       29.000  42000.000
50%       35.000  52000.000
75%       41.000  65000.000
max       60.000  98000.000
```

---

### 🔹 In short:

> **Summary statistics** provide a **compact numerical description** of your dataset, helping you quickly grasp its main characteristics and detect patterns or anomalies.

---

Would you like me to show how to calculate and interpret these statistics for a specific DataFrame example?


#Code Section
## Load Dataframe

In [5]:
import pandas as pd
from io import StringIO

url = "https://raw.githubusercontent.com/yu-to-chen/data-science/master/assets/data/nba_salaries.csv"
df = pd.read_csv(url)

##Summary Stats

In [6]:
df.columns = ['PLAYER', 'POSITION', 'TEAM', 'SALARY']

# Basic info
print("Dataset Overview")
print(df.info(), "\n")

print("🔹 First 5 rows:")
print(df.head(), "\n")

# Summary Statistics
print("Summary Statistics (Numeric):")
print(df.describe(), "\n")

print("Summary Statistics (All Columns):")
print(df.describe(include='all'), "\n")

# Group Analysis
print("Average Salary by Team (Top 10):")
print(df.groupby("TEAM")["SALARY"].mean().sort_values(ascending=False).head(10), "\n")

print("Average Salary by Position:")
print(df.groupby("POSITION")["SALARY"].mean().sort_values(ascending=False), "\n")

# Salary Extremes
print("Top 10 Highest Paid Players:")
print(df.nlargest(10, "SALARY")[["PLAYER", "TEAM", "SALARY"]], "\n")

print("10 Lowest Paid Players:")
print(df.nsmallest(10, "SALARY")[["PLAYER", "TEAM", "SALARY"]])

Dataset Overview
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 417 entries, 0 to 416
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   PLAYER    417 non-null    object 
 1   POSITION  417 non-null    object 
 2   TEAM      417 non-null    object 
 3   SALARY    417 non-null    float64
dtypes: float64(1), object(3)
memory usage: 13.2+ KB
None 

🔹 First 5 rows:
           PLAYER POSITION           TEAM     SALARY
0    Paul Millsap       PF  Atlanta Hawks  18.671659
1      Al Horford        C  Atlanta Hawks  12.000000
2  Tiago Splitter        C  Atlanta Hawks   9.756250
3     Jeff Teague       PG  Atlanta Hawks   8.000000
4     Kyle Korver       SG  Atlanta Hawks   5.746479 

Summary Statistics (Numeric):
           SALARY
count  417.000000
mean     5.074814
std      5.221437
min      0.030888
25%      1.270964
50%      3.000000
75%      7.000000
max     25.000000 

Summary Statistics (All Columns):
              PLAYER 

##What’s the most interesting finding about the data?
What caught my attention the most is how big the salary gap is between teams and players. The Cleveland Cavaliers had by far the highest average salary, which makes sense given their star lineup at the time. I also found it interesting that centers earned the most on average, showing how important that position was. It’s funny to see Kevin Durant still in OKC before his move to the Warriors, and how little Antetokounmpo was making back then compared to what he’s worth now.
