# Week 2: Python Fundamentals and Basic Statistics
Here’s how you can structure these notes within a Jupyter Notebook to make it organized and easy to understand:

---

```markdown
# Python Libraries for Data Analysis and Visualization

## Pandas
- **Purpose**: Data manipulation and analysis (e.g., CSV processing, handling DataFrames).
- **Key Features**:
  - Easily load and explore datasets (e.g., CSV files).
  - Perform operations like filtering, grouping, and aggregating data.
  - Supports handling missing data efficiently.
  
**Example**:
```python
import pandas as pd
data = pd.read_csv("your_file.csv")
print(data.head())  # Displays the first 5 rows
```

---

## Numpy
- **Purpose**: Numerical computations (e.g., arrays, statistics).
- **Key Features**:
  - Supports multi-dimensional arrays for storing numerical data.
  - Provides functions for fast mathematical computations.
  - Used in conjunction with other libraries like Pandas and Scikit-learn.

**Example**:
```python
import numpy as np
array = np.array([1, 2, 3, 4, 5])
print(array.mean())  # Outputs the average
```

---

## Seaborn
- **Purpose**: Data visualization (e.g., histograms, scatter plots, correlation matrices).
- **Key Features**:
  - Simplifies the creation of complex visualizations.
  - Allows customization with colors, labels, and annotations.
  - Built on top of Matplotlib for better usability.

**Example**:
```python
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot([1, 2, 2, 3, 3, 3, 4], bins=4)
plt.show()
```

---

## Matplotlib
- **Purpose**: Basic plotting and graphical representations.
- **Key Features**:
  - Highly customizable for creating unique visualizations.
  - Supports line plots, bar charts, scatter plots, and more.
  - Often used alongside Seaborn for advanced visualizations.

**Example**:
```python
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4], [10, 20, 25, 30])
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Line Plot Example")
plt.show()
```

---

## Statsmodels
- **Purpose**: Statistical modeling and regression/ANOVA analysis.
- **Key Features**:
  - Perform regression analysis to understand relationships between variables.
  - Tools for hypothesis testing and advanced statistical models.
  
**Example**:
```python
import statsmodels.api as sm
X = [1, 2, 3, 4]  # Predictor variable
y = [2, 4, 6, 8]  # Response variable
model = sm.OLS(y, sm.add_constant(X)).fit()
print(model.summary())
```

---

## Geopandas
- **Purpose**: Geospatial data analysis for location-specific datasets.
- **Key Features**:
  - Read and process spatial data formats like shapefiles.
  - Perform spatial joins, intersections, and buffers.
  - Create maps for data visualization.

**Example**:
```python
import geopandas as gpd
gdf = gpd.read_file("your_shapefile.shp")
print(gdf.head())  # Displays the first few rows of spatial data
```

---

# Summary
These libraries are essential for any data analysis workflow:
- **Pandas**: Tabular data manipulation.
- **Numpy**: Numerical calculations.
- **Seaborn & Matplotlib**: Data visualization.
- **Statsmodels**: Statistical modeling.
- **Geopandas**: Geospatial data analysis.

Feel free to copy and paste this into a Jupyter Notebook! Just for now!!! 😊

In [None]:
import pandas as pd

# Read a CSV file
df = pd.read_csv("sample_data.csv") # if you have the data downloaded local and you need to read it 
print(df.head())  # Display the first 5 rows

# Get basic stats
print(df["column_name"].mean())
print(df["column_name"].unique())

In [None]:
# https://support.socrata.com/hc/en-us/articles/202949268-How-to-query-more-than-1000-rows-of-a-dataset
import pandas as pd

# URL of the dataset (Traffic Collisions in NYC)
url = "https://data.cityofnewyork.us/resource/h9gi-nx95.csv"

# Set a limit for the number of records to fetch (use $limit parameter)
record_limit = 50  # Adjust the limit as needed 

# Append the $limit parameter to the URL
query_url = f"{url}?$limit={record_limit}"

# Read the dataset using Pandas
print(f"Fetching data from: {query_url}")
try:
    df = pd.read_csv(query_url)
    print("Dataset successfully loaded!")

    # Display the first 5 rows of the dataset
    print("First 5 rows of the dataset:")
    print(df.head())

    # Display basic information about the dataset
    print("\nDataset Info:")
    print(df.info())

except Exception as e:
    print("An error occurred while fetching the dataset:", e)

Fetching data from: https://data.cityofnewyork.us/resource/h9gi-nx95.csv?$limit=50
Dataset successfully loaded!
First 5 rows of the dataset:
                crash_date crash_time   borough  zip_code  latitude  \
0  2021-09-11T00:00:00.000       2:39       NaN       NaN       NaN   
1  2022-03-26T00:00:00.000      11:45       NaN       NaN       NaN   
2  2023-11-01T00:00:00.000       1:29  BROOKLYN   11230.0  40.62179   
3  2022-06-29T00:00:00.000       6:55       NaN       NaN       NaN   
4  2022-09-21T00:00:00.000      13:21       NaN       NaN       NaN   

   longitude                       location           on_street_name  \
0        NaN                            NaN    WHITESTONE EXPRESSWAY   
1        NaN                            NaN  QUEENSBORO BRIDGE UPPER   
2 -73.970024  \n,  \n(40.62179, -73.970024)            OCEAN PARKWAY   
3        NaN                            NaN       THROGS NECK BRIDGE   
4        NaN                            NaN          BROOKLYN BRIDGE   
