<a href="https://colab.research.google.com/github/boonecabaldev/pandas_exercises/blob/main/Pandas_Exercise_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Exercises #4

Here is another `pandas` exercise set. Your doin' grate.

## Problem 1: Financial Data Analysis

**File:** `stock_transactions.csv`

```
TransactionID,Symbol,Date,Type,Shares,Price
T12345,AAPL,2023-11-20,Buy,100,150.50
T23456,GOOG,2023-11-21,Sell,50,1250.25
T34567,AAPL,2023-11-22,Buy,75,148.75
T45678,MSFT,2023-11-23,Buy,200,285.30
T56789,GOOG,2023-11-24,Buy,30,1265.10
T67890,AAPL,2023-11-25,Sell,150,152.25
```

**Tasks:**

1. Read the CSV into a DataFrame, parsing `Date` as datetime.
2. Calculate the total cost of buying and the total revenue from selling for each stock symbol.
3. Determine the net profit or loss for each stock symbol.
4. Find the stock with the highest total transaction volume (shares bought + shares sold).

**My Solution**

In [None]:
# Deferred to Original Solution

**Original Solution:**

In [None]:
import pandas as pd

# Read CSV with date parsing
df_transactions = pd.read_csv('sample_data/stock_transactions.csv', parse_dates=['Date'])

# Calculate cost and revenue per symbol
df_transactions['Cost'] = df_transactions['Shares'] * df_transactions['Price'] * (df_transactions['Type'] == 'Buy')
df_transactions['Revenue'] = df_transactions['Shares'] * df_transactions['Price'] * (df_transactions['Type'] == 'Sell')

cost_per_symbol = df_transactions.groupby('Symbol')['Cost'].sum()
revenue_per_symbol = df_transactions.groupby('Symbol')['Revenue'].sum()

# Net profit/loss
net_profit_loss = revenue_per_symbol - cost_per_symbol
print("\nNet profit/loss per symbol:")
print(net_profit_loss.to_markdown(numalign="left", stralign="left"))

# Highest transaction volume
transaction_volume = df_transactions.groupby('Symbol')['Shares'].sum()
highest_volume_symbol = transaction_volume.idxmax()
print(f"\nSymbol with highest transaction volume: {highest_volume_symbol}")


TransactionID Symbol       Date Type  Shares   Price     Cost
       T12345   AAPL 2023-11-20  Buy     100  150.50 15050.00
       T23456   GOOG 2023-11-21 Sell      50 1250.25     0.00
       T34567   AAPL 2023-11-22  Buy      75  148.75 11156.25
       T45678   MSFT 2023-11-23  Buy     200  285.30 57060.00
       T56789   GOOG 2023-11-24  Buy      30 1265.10 37953.00
       T67890   AAPL 2023-11-25 Sell     150  152.25     0.00

Net profit/loss per symbol:
| Symbol   | 0        |
|:---------|:---------|
| AAPL     | -3368.75 |
| GOOG     | 24559.5  |
| MSFT     | -57060   |

Symbol with highest transaction volume: AAPL


## Problem 2:  Customer Segmentation

**File:** `customer_data.csv`

```
CustomerID,Age,Gender,Income,SpendingScore
1,19,Male,15000,39
2,21,Male,15000,81
3,20,Female,16000,6
4,23,Female,16000,77
5,31,Female,17000,40
```

**Tasks:**

1. Read the CSV into a DataFrame.
2. Use k-means clustering to segment customers into 3 groups based on `Age`, `Income`, and `SpendingScore`.
3. Add a `Segment` column to the DataFrame indicating the assigned cluster for each customer (e.g., 'Segment 0', 'Segment 1', 'Segment 2').

**My Solution**

In [None]:
# Defer to Original Solution

# This is beyond the pandas stuff I'm doing at this juncture

**Original Solution:**

In [None]:
import pandas as pd
from sklearn.cluster import KMeans

# Read CSV
df_customers = pd.read_csv('sample_data/customer_data.csv')

# Clustering
X = df_customers[['Age', 'Income', 'SpendingScore']]
kmeans = KMeans(n_clusters=3, random_state=0, n_init=10).fit(X)

# Add segment labels
df_customers['Segment'] = ['Segment ' + str(label) for label in kmeans.labels_]

print("\nCustomer data with segments:")
print(df_customers.head().to_markdown(index=False, numalign="left", stralign="left"))


Customer data with segments:
| CustomerID   | Age   | Gender   | Income   | SpendingScore   | Segment   |
|:-------------|:------|:---------|:---------|:----------------|:----------|
| 1            | 19    | Male     | 15000    | 39              | Segment 1 |
| 2            | 21    | Male     | 15000    | 81              | Segment 1 |
| 3            | 20    | Female   | 16000    | 6               | Segment 2 |
| 4            | 23    | Female   | 16000    | 77              | Segment 2 |
| 5            | 31    | Female   | 17000    | 40              | Segment 0 |


## Problem 3:  Text Data Cleaning


**File:** `product_reviews.txt` (One review per line)

```
This product is amazing! 5 stars
I love it!
It's okay, could be better.
Not worth the price.
The best purchase I've made this year.
```

**Tasks:**

1. Read the text file into a Series.
2. Convert all reviews to lowercase.
3. Remove punctuation from the reviews.
4. Calculate the frequency of each word in all reviews.

**My Solution**

In [None]:
# Defer to Original Solution

# Will be studing this

**Original Solution:**

In [None]:
import pandas as pd

# Read text file into Series
with open('sample_data/product_reviews.txt', 'r') as file:
    reviews = pd.Series(file.read().splitlines())

# Lowercase and remove punctuation
reviews = reviews.str.lower().str.replace('[^\w\s]', '', regex=True)

# Word frequency
all_words = ' '.join(reviews).split()
word_freq = pd.Series(all_words).value_counts()

print("\nWord frequency:")
print(word_freq.head(10).to_markdown(numalign="left", stralign="left"))


Word frequency:
|          | count   |
|:---------|:--------|
| this     | 2       |
| the      | 2       |
| be       | 1       |
| made     | 1       |
| ive      | 1       |
| purchase | 1       |
| best     | 1       |
| price    | 1       |
| worth    | 1       |
| not      | 1       |


Let me know if you'd like any clarification or more practice examples!