In [4]:
import pandas as pd
import numpy as np

The `nunique()` method in Pandas is used to count the number of unique or distinct values in a Series (like a DataFrame column) or across a DataFrame.

Key features and usage:
- **Counting Unique Values in a Series/Column**: When applied to a specific column of a DataFrame, `nunique() `returns a single integer representing the count of unique values in that column.

In [2]:
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
        'Age': [25, 30, 25, 35, 30]}
df = pd.DataFrame(data)

unique_names_count = df['Name'].nunique()
print(f"Number of unique names: {unique_names_count}")

Number of unique names: 3


- **Counting Unique Values in a DataFrame**: When applied directly to a DataFrame, `df.nunique()` returns a Series where the index consists of the column names and the values are the counts of unique entries in each respective column.

In [3]:
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
        'Age': [25, 30, 25, 35, 30]}
df = pd.DataFrame(data)

unique_counts_per_column = df.nunique()
print("Unique counts per column:\n", unique_counts_per_column)

Unique counts per column:
 Name    3
Age     3
dtype: int64


- **Handling NaN Values**: By default, `nunique()` ignores `NaN` (Not a Number) values when counting. This behavior can be controlled using the `dropna` parameter. Setting `dropna=False` will include `NaN` as a unique value if present.

- **Specifying Axis**: For DataFrames, the axis parameter can be used to specify whether to count unique values row-wise (`axis=1` or `'columns'`) or column-wise (`axis=0` or `'index'`). By default, it operates column-wise.

In [None]:
nums = [[1, 2], [np.nan, 3], [7, 6]]
newdf = pd.DataFrame(nums)

from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

#find the mean
imp.fit(newdf)


0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False


In [9]:

X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(pd.DataFrame(X))
print(imp.transform(X))

     0    1
0  NaN  2.0
1  6.0  NaN
2  7.0  6.0
[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]


In [10]:
imp.statistics_


array([4.        , 3.66666667])

In [12]:
df = pd.DataFrame(
     [
         [24.3, 75.7, "high"],
         [31, 87.8, "high"],
         [22, 71.6, "medium"],
         [35, 95, "medium"],
     ],
     columns=["temp_celsius", "temp_fahrenheit", "windspeed"],
     index=pd.date_range(start="2014-02-12", end="2014-02-15", freq="D"),
 )
df

Unnamed: 0,temp_celsius,temp_fahrenheit,windspeed
2014-02-12,24.3,75.7,high
2014-02-13,31.0,87.8,high
2014-02-14,22.0,71.6,medium
2014-02-15,35.0,95.0,medium


In [20]:
df.filter(like = "2014-02-13", axis=0)

Unnamed: 0,temp_celsius,temp_fahrenheit,windspeed
2014-02-13,31.0,87.8,high


### Standardization


In [21]:
from sklearn.preprocessing import StandardScaler

# Sample data with different scales
X = np.array([[1, 1000], 
              [2, 2000], 
              [3, 3000], 
              [4, 4000], 
              [5, 5000]])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Original data:")
print(f"Means: {X.mean(axis=0)}")
print(f"Std devs: {X.std(axis=0)}")

print("\nStandardized data:")
print(f"Means: {X_scaled.mean(axis=0)}")  # â‰ˆ [0, 0]
print(f"Std devs: {X_scaled.std(axis=0)}")  # â‰ˆ [1, 1]

Original data:
Means: [   3. 3000.]
Std devs: [   1.41421356 1414.21356237]

Standardized data:
Means: [0. 0.]
Std devs: [1. 1.]


In [22]:
X_scaled

array([[-1.41421356, -1.41421356],
       [-0.70710678, -0.70710678],
       [ 0.        ,  0.        ],
       [ 0.70710678,  0.70710678],
       [ 1.41421356,  1.41421356]])

### MinMax Scaling
**Custom Range:**
You can scale to any range, not just $[0, 1]$:
```
scaler = MinMaxScaler(feature_range=(-1, 1))
X_scaled_custom = scaler.fit_transform(X)
```
**Key Characteristics**
- Advantages:
    * Preserves original distribution: Doesn't change the shape of the data

    * Intuitive: Easy to understand and interpret

    * Bounded range: All values fall within known boundaries

    * Good for algorithms that require positive values: Like neural networks with certain activation functions

- Disadvantages:
    * Sensitive to outliers: A single extreme value can compress most data points

    * Doesn't handle outliers well: Extreme values distort the scaling

In [23]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data
X = np.array([[1,  100],
              [2,  500],
              [3,  300],
              [4,  800],
              [5,  200]])

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

print("Original data:")
print(X)
print("\nMinMax scaled data (0-1 range):")
print(X_scaled)

Original data:
[[  1 100]
 [  2 500]
 [  3 300]
 [  4 800]
 [  5 200]]

MinMax scaled data (0-1 range):
[[0.         0.        ]
 [0.25       0.57142857]
 [0.5        0.28571429]
 [0.75       1.        ]
 [1.         0.14285714]]


## ðŸ“– Notes
- [Numpy Notes](./notes/numpy-notes.md)
- [SQL Snippets](./notes/sql-snippets.md)

## Experiments
| What I practice as I learn new things.
- [Medical Charges](./notebooks/experiments/medical_charges_example.ipynb)

## Machine Learning
| Any models I learn or practice

