# Pandas Basics Portfolio
This notebook demonstrates foundational pandas concepts through challenges and practical code.

## Introduction to Pandas & NumPy
**Challenge:** Perform high-speed numerical operations without writing manual loops. **Theory:** NumPy provides vectorized arrays while pandas builds on NumPy to add labeled data structures.

In [None]:
import numpy as np
import pandas as pd

### Challenge: Subtract a constant from every price in a list.
**Theory:** NumPy arrays support element-wise operations, avoiding Python loops.

In [None]:
toyPrices = np.array([5,8,3,6])
toyPrices - 2

### Challenge: Label numeric data with custom names.
**Theory:** A pandas `Series` pairs values with an index so we can reference data by label.

In [None]:
ages = np.array([13,25,19])
ser = pd.Series(ages, index=['Emma','Swetha','Serajh'])
ser

### Challenge: Organize records in a table and choose a meaningful index.
**Theory:** A `DataFrame` is a 2D table; `.set_index` assigns a column as row labels.

In [None]:
df = pd.DataFrame([
    ['John Smith','123 Main St',34],
    ['Jane Doe','456 Maple Ave',28],
    ['Joe Schmo','789 Broadway',51]
], columns=['name','address','age'])

df.set_index('name')

## Lesson 1 — Lambda Functions
Anonymous one-line functions help express small transformations inline.

### Challenge: Increment numbers without defining a full function.
**Theory:** `lambda` creates a small unnamed function.

In [None]:
add_two = lambda x: x + 2
add_two(3)

### Challenge: Check if a letter exists inside a word.
**Theory:** The `in` operator tests substring membership.

In [None]:
contains_a = lambda word: 'a' in word
contains_a('banana')

### Challenge: Flag strings longer than twelve characters.
**Theory:** `len()` returns length; combine with comparison.

In [None]:
long_string = lambda s: len(s) > 12
long_string('photosynthesis')

### Challenge: Determine a word's final letter.
**Theory:** Negative indexing retrieves characters from the end.

In [None]:
ends_in_a = lambda s: s[-1] == 'a'
ends_in_a('data')

### Challenge: Return double a number when it's large, else zero.
**Theory:** Lambdas support inline conditionals `A if cond else B`.

In [None]:
double_or_zero = lambda n: n*2 if n > 10 else 0
double_or_zero(15)

### Challenge: Classify numbers as even or odd.
**Theory:** Modulo 2 reveals parity.

In [None]:
even_or_odd = lambda n: 'even' if n % 2 == 0 else 'odd'
even_or_odd(7)

### Challenge: Identify multiples of three.
**Theory:** Use `%` to test divisibility.

In [None]:
multiple_of_three = lambda n: 'multiple of three' if n % 3 == 0 else 'not a multiple'
multiple_of_three(9)

### Challenge: Grab the ones place of a number.
**Theory:** `% 10` isolates the final digit.

In [None]:
ones_place = lambda n: n % 10
ones_place(123)

### Challenge: Compute twice the square of a number.
**Theory:** Exponentiation uses `**`.

In [None]:
double_square = lambda n: 2 * (n**2)
double_square(5)

### Challenge: Add a random amount to a base number.
**Theory:** Lambdas can include library calls like `random.randint`.

In [None]:
import random
add_random = lambda n: n + random.randint(1,10)
add_random(5)

## Lesson 2 — Creating, Loading, and Selecting Data

### Challenge: Build a table from a Python dictionary.
**Theory:** DataFrame columns come from dictionary keys.

In [None]:
df = pd.DataFrame({
    'name': ['John Smith','Jane Doe','Joe Schmo'],
    'address': ['123 Main St.','456 Maple Ave.','789 Broadway'],
    'age': [34,28,51]
})
df

### Challenge: Create a table from a list of lists.
**Theory:** Provide `columns` to name each field.

In [None]:
cities = pd.DataFrame([
    ['San Diego',100],
    ['Los Angeles',120],
    ['San Francisco',90]
], columns=['Location','Employees'])
cities

### Challenge: Persist data to disk and reload it.
**Theory:** `to_csv` writes a DataFrame; `read_csv` brings it back.

In [None]:
cities.to_csv('sample.csv', index=False)
reloaded = pd.read_csv('sample.csv')
reloaded.head()

### Challenge: Inspect a DataFrame's structure.
**Theory:** `head`, `info`, and `describe` reveal sample rows, column types, and statistics.

In [None]:
reloaded.head()
reloaded.info()

### Challenge: Pull specific columns.
**Theory:** Selecting a single column returns a Series; multiple columns give a DataFrame.

In [None]:
ages = df['age']
subset = df[['name','age']]
ages.head(), subset.head()

### Challenge: Grab rows by position.
**Theory:** `iloc` uses integer indexing and slicing.

In [None]:
row2 = df.iloc[2]
last3 = df.iloc[-3:]
row2, last3

### Challenge: Filter rows with logical conditions.
**Theory:** Boolean masks and operators `&` `|` restrict rows.

In [None]:
jan = df[df.age > 30]
subset = df[df['address'].str.contains('Maple')]
jan, subset

### Challenge: Reset the index after subsetting.
**Theory:** `reset_index(drop=True)` creates consecutive numbering.

In [None]:
subset = df.loc[[0,2]]
subset.reset_index(drop=True)

## Lesson 3 — Modifying DataFrames

### Challenge: Add new columns with constants or lists.
**Theory:** Assigning a scalar broadcasts; lists must match row count.

In [None]:
df['In Stock?'] = True
df['Sold in Bulk?'] = ['Yes','Yes','No']
df.head()

### Challenge: Compute values from existing columns.
**Theory:** Vectorized arithmetic creates efficient derived columns.

In [None]:
df['Taxed Age'] = df['age'] * 0.1
df.head()

### Challenge: Transform a column with a function.
**Theory:** `Series.apply` applies Python functions element-wise.

In [None]:
df['Lower'] = df['name'].apply(str.lower)
df.head()

### Challenge: Compute a value needing multiple columns.
**Theory:** `DataFrame.apply` with `axis=1` operates row-by-row.

In [None]:
bonus = lambda row: row['age'] * 2 if row['age']>40 else row['age']
df['bonus'] = df.apply(bonus, axis=1)
df.head()

### Challenge: Rename columns wholesale.
**Theory:** Assign a new list to `df.columns`.

In [None]:
df.columns = ['Name','Address','Age','In Stock?','Sold in Bulk?','Taxed Age','Lower','Bonus']
df.head()

### Challenge: Rename a single column safely.
**Theory:** `.rename` targets specific columns.

In [None]:
df.rename(columns={'Address':'Street'}, inplace=True)
df.head()

## Lesson 4 — Aggregates in Pandas

### Challenge: Summarize a single column.
**Theory:** Descriptive methods compute statistics like mean or unique counts.

In [None]:
df['Age'].mean(), df['Name'].nunique()

### Challenge: Count orders by product and status.
**Theory:** `groupby` aggregates rows sharing keys.

In [None]:
orders = pd.read_csv('pandas_basics_project/data/orders.csv')
shoe_counts = orders.groupby(['product_id','status']).order_id.count().reset_index()
shoe_counts.head()

### Challenge: Compare categories across two dimensions.
**Theory:** Pivot tables reshape grouped data for easier comparison.

In [None]:
pivoted = shoe_counts.pivot(index='product_id', columns='status', values='order_id').reset_index()
pivoted.head()

## Lesson 5 — Working with Multiple DataFrames

### Challenge: Combine customer and order information.
**Theory:** `merge` performs SQL-style joins.

In [None]:
customers = pd.read_csv('pandas_basics_project/data/customers.csv')
combined = orders.merge(customers, on='customer_id', how='inner')
combined.head()

### Challenge: Merge on columns with different names.
**Theory:** Use `left_on` and `right_on` to specify keys.

In [None]:
products = pd.read_csv('pandas_basics_project/data/products.csv')
renamed = products.rename(columns={'product_id':'id'})
orders.merge(renamed, left_on='product_id', right_on='id').head()

### Challenge: Stack two tables vertically.
**Theory:** `pd.concat` appends rows when schemas match.

In [None]:
menu_a = pd.DataFrame({'item':['Cake','Pie']})
menu_b = pd.DataFrame({'item':['Donut']})
menu = pd.concat([menu_a, menu_b])
menu

## Reusable Helpers
Reusable helper functions streamline repetitive analysis tasks.

In [None]:
def value_props(series):
    counts = series.value_counts()
    props = counts / len(series)
    return pd.DataFrame({'count': counts, 'proportion': props})

def quick_corr(df, col1, col2):
    return df[[col1, col2]].corr().iloc[0,1]


### Challenge: Convert data types explicitly.
**Theory:** `.astype` changes a column's dtype, helping treat numbers or categories properly.

In [None]:
combined['discount'] = combined['discount'].astype(float)
combined.dtypes.head()

### Challenge: Impose an order on string categories.
**Theory:** `pd.Categorical` stores ordered labels for proper sorting and comparisons.

In [None]:
sizes = pd.Categorical(['M','S','L','M'], categories=['S','M','L'], ordered=True)
sizes

### Challenge: Prepare categorical data for modeling.
**Theory:** `pd.get_dummies` expands categories into binary indicator columns.

In [None]:
pd.get_dummies(products['supplier']).head()

### Challenge: Standardize values in a column.
**Theory:** `.replace` swaps old entries with new ones.

In [None]:
combined['status'].replace({'Pending':'P','Shipped':'S','Delivered':'D'}, inplace=True)
combined[['status']].head()

### Challenge: Summarize distributions with statistics and plots.
**Theory:** Measures like mean and std plus Matplotlib visuals reveal central tendency and spread.

In [None]:
import matplotlib.pyplot as plt
orders['quantity'].mean(), orders['quantity'].std()
orders['quantity'].hist()
plt.show()

### Challenge: View category frequencies and proportions.
**Theory:** `value_counts(normalize=True)` returns percentages.

In [None]:
orders['status'].value_counts(normalize=True)

### Challenge: Compare a numeric variable across categories.
**Theory:** Boxplots visualize distribution differences across groups.

In [None]:
orders.boxplot(column='quantity', by='status')
plt.show()

### Challenge: Overlay histograms for multiple groups.
**Theory:** Density normalization allows comparisons despite sample size differences.

In [None]:
for label,grp in orders.groupby('status'):
    grp['quantity'].plot(kind='hist', alpha=0.5, density=True, label=label)
plt.legend(); plt.show()

### Challenge: Visualize distributions for many categories.
**Theory:** Multi-group boxplots show spread for each subgroup.

In [None]:
orders['weekday']=pd.to_datetime(orders['order_date']).dt.day_name()
orders.boxplot(column='quantity', by='weekday', rot=45)
plt.show()

### Challenge: Measure association between two numeric variables.
**Theory:** Scatterplots reveal relationships; covariance and Pearson correlation quantify strength.

In [None]:
plt.scatter(products['price'], products['stock'])
plt.xlabel('price'); plt.ylabel('stock'); plt.show()
products[['price','stock']].cov()
products[['price','stock']].corr(method='pearson')

### Challenge: Test independence between categorical variables.
**Theory:** Contingency tables and the Chi-square test assess association.

In [None]:
import scipy.stats as stats
ct = pd.crosstab(combined['status'], combined['city'])
chi2, p, dof, expected = stats.chi2_contingency(ct)
chi2, p