# Python CheatSheet

📚 **Index**

- [**Basic Data Analysis**](#basic-data-analysis)
- [**Selection**](#selection)
- [**DateTimes Series in Pandas**](#datetimes-series-in-pandas)
- [**Plotting**](#plotting)
- [**Time series in Pandas**](#time-series-in-pandas)
- [**Time series forecasting with Linear Regression**](#time-series-forecasting-with-linear-regression)
    - [Linear Regression setup](#linear-regression-setup)

In [8]:
# dataset for using examples below
import pandas as pd
import numpy as np

# Reusable sample dataset for this cheatsheet (deterministic)
rng = np.random.default_rng(42)
dates = pd.date_range('2024-01-01', periods=12, freq='M')
cities = ['Amsterdam', 'Rotterdam', 'Utrecht']
products = ['Widget', 'Gadget', 'Doohickey']

records = []
_id = 1000
for d in dates:
    for c in cities:
        for p in products:
            units = int(rng.integers(1, 20))
            price = float(np.round(rng.uniform(5, 50), 2))
            records.append({
                'id': _id,
                'date': d,
                'city': c,
                'product': p,
                'units': units,
                'price': price,
                'revenue': float(np.round(units * price, 2))
            })
            _id += 1

df = pd.DataFrame(records)
# Quick peek so you can verify it's there
df.head()

  dates = pd.date_range('2024-01-01', periods=12, freq='M')


Unnamed: 0,id,date,city,product,units,price,revenue
0,1000,2024-01-31,Amsterdam,Widget,2,24.75,49.5
1,1001,2024-01-31,Amsterdam,Gadget,15,43.64,654.6
2,1002,2024-01-31,Amsterdam,Doohickey,2,9.24,18.48
3,1003,2024-01-31,Rotterdam,Widget,14,48.9,684.6
4,1004,2024-01-31,Rotterdam,Gadget,14,40.37,565.18


In [9]:
# Creating indexes (without modifying the original `df`)
# 1) Single-column index by unique id
df_by_id = df.set_index('id').sort_index()

# 2) Time index (useful for resampling, rolling windows, etc.)
df_by_date = df.set_index('date').sort_index()

# 3) MultiIndex by city and product
df_city_product = df.set_index(['city', 'product']).sort_index()

# If you ever want to bring the index back as columns, use reset_index():
# df_by_id = df_by_id.reset_index()

## **Basic Data analysis**

- `.head()` returns the first few rows of the dataframe.

- `.info()` gives you basic information about the dataframe.

- `.shape()` gives you the shape (dimensions) of the dataframe.

- `.columns` gives you all the column names.

- Descriptive Analysis: `.mean()`, `.median()`, `.mode()``

- Counting data by category: `.value_counts()``

- Grouping the data: `df.groupby('column').sum()`



## **Selection**

- Use Brackets ( ‘[ ]’ ): For quick column selection and simple boolean row filtering.
    - `names = df["Name"]` or `older_than_25 = df[df["age"]>25]`

- Use `at`: For fast access to a single scalar value. To access the value in row 2, column "A", you would use:
    - `value = df.at[2, "A"]`


- Use loc: For label-based selection and conditional filtering.
    - Exemplo 01: Select rows based on a condition: `df_selection = df[df[Age]>30]`
    - Exemplo 02: Select first 3 rows and 2 specific columns `df_selection = df[0:3, ["Age", "Country"]]`

- Use iloc: For integer-based selection by position.
    - Exemplo: `subset = df.loc[0:3, 0:2]`, gets the first 3 rows and the first 2 columns

- Use query: For spreadsheet-like, readable filtering expressions.
    - Exemplo: `subset = df.query("Age >30")` 

## **DateTimes Series in Pandas**

- Date Formatter String options

```
%y - Last two digits of the year
%Y - Full year with century
%m - Month as zero-padded number
%d - Day as a zero-padded number
%b - Abbreviated month name

```

- Convert Strings to DataTime

In [10]:
df["date"] = pd.to_datetime(df["date"])

- Use `DateTime` as index

In [11]:
df_withindex = df.set_index("date")

- Check datatypes

In [12]:
print(df.dtypes)

id                  int64
date       datetime64[ns]
city               object
product            object
units               int64
price             float64
revenue           float64
dtype: object


- Accessing Date components

In [14]:
# For a single date
year = df["date"][0].year
month = df["date"][0].month
day = df["date"][0].day
weekday = df["date"][0].weekday()

print(f"day: {day}, month: {month}, year: {year}, weekday: {weekday}")

# For a Series of dates
days = df["date"].dt.day


day: 31, month: 1, year: 2024, weekday: 2


- Using `inplace` argument to modify a df directly

In [None]:
# Setting index without using "inplace"
# You need to save what the method returns to a variable
df = df.set_index("date")
# You can also save it to a variable with a different name
df_indexed = df.set_index("date")

# Setting index with inplace=True
# The dataframe stored in the variable df changes directly
df.set_index("date", inplace=True)

- Selecting data by a date

In [None]:
data_on_date = df.loc["2020-02-03"]

# Select a date range
february_2020 = df.loc["2020-02-01":"2020-02-28"]

## **Plotting**

- Plotting Time Series Data

In [None]:
import seaborn as sns

sns.lineplot(df, x=df.index, y="adjusted_close")
plt.show()

- Plotting multiple series

In [None]:
sns.lineplot(data=df["2020-01"], x=df["2020-01"].index.day, y="adjusted_close", label="January")
sns.lineplot(data=df["2020-02"], x=df["2020-02"].index.day, y="adjusted_close", label="February", color="red", linestyle="--")
sns.lineplot(data=df["2020-03"], x=df["2020-03"].index.day, y="adjusted_close", label="March", color="darkgray")
plt.legend()
plt.show()

- Formatting Date Axis Labels

In [None]:
#Import Matplotlib Dates
import matplotlib.dates as mdates
ax = sns.lineplot(data=df, x=df.index, y="adjusted_close", color="#FF9900")
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter("%y"))
plt.show()

- Highlight reference lines

In [None]:
# Add horizontal reference lines
plt.axhline(y=value, color="color", linestyle="style")

## **Time series in Pandas**

- **Calculate Moving Average**: Use `.rolling()` method to calculate a moving average on a Pandas series.



In [None]:
df["rolling_avg_column"] = df["column_name"].rolling(window=10).mean()

# plotting with seaborn
sns.lineplot(df["rolling_avg_column"])

- **Percent Change**: Use `pct_change()`

In [None]:
df["pct_change_column"] = df["column_name"].pct_change()

# filter for a specific percent change
# Filter rows with significant change (< -0.05) and store in new DataFrame
significant_changes_df = df[df["pct_change_column"] < -0.05]

# plotting percent change
sns.lineplot(df["pct_change_column"])

## **Time series forecasting with Linear Regression**

- Convert Data column to index


In [None]:
df.set_index("datetime_column", inplace=True)

- Resample data

In [None]:
# resample to weekly frequency
df_weekly= df.resample("W").first() # use .mean() or .sum() as needed

- Add numerical index

In [None]:
# create a numeric index
df_weekly["idx"] = range(len(df_weekly))

## **Linear Regression with Statsmodels**

In [None]:
import statsmodels.api as sm

# Define predictors
predictors = ["idx"]
X = sm.add_constant(df_weekly[predictors])  # Add constant for intercept

# Define target variable (e.g., temperature)
Y = df_weekly["temperature_column"]

# Create the model
model = sm.OLS(y, X)
# Train the model
results = model.fit()

# Print model summary
print(results.summary())


- Add categorical features

In [None]:
predictors = ["idx", "season"]

# Encode season as dummy variables
dummies = pd.get_dummies(df[predictors], columns=["season"], drop_first=True, dtype=int)

# Add constant for intercept
X = sm.add_constant(dummies)  

- Line of best fit: Visualize the regression line on a scatter plot.

In [None]:
plt.scatter(df["carat"], df["price"])
plt.plot(df["carat"], model.predict(X), color="red")
plt.xlabel("Carat")
plt.ylabel("Price")
plt.title("Carat vs. Price with Regression Line")
plt.show()

- Model evaluation: Calculate residuals and MAE

In [None]:
# Calculate residuals
y_pred = model.predict(X)
residuals = y - y_pred

# Calculate mean absolute error
MAE = residuals.abs().mean()
print("MAE:", MAE)

- Model Diagnostics: Use scatter plots of predictions vs. actuals to evaluate model fit.

In [None]:
predictions = model.predict(X)
plt.scatter(predictions, y)
plt.xlabel("Predicted Price")
plt.ylabel("Actual Price")
plt.title("Predicted vs. Actual Prices")
plt.show()

- Forecasting with the model

In [None]:
# Predict future value (week 210)
future_week = [1, 210, 0, 0, 1]  # 1 for constant, 210 for idx, dummy variables for seasons
future_temp = model.predict([future_week])

## Confidence interval for means

In [None]:
from spicy import stats

# Descriptive statistics
n = df["price"].count()  # Sample size
xbar = df["price"].mean()  # Sample mean
s = df["price"].std()  # Sample standard deviation
conf = 0.95  # Confidence level

# Standard error calculation
SEM = s / np.sqrt(n)

# Confidence interval
interval = stats.norm.interval(conf, loc=xbar, scale=SEM)
print("With 95% confidence, the true mean price is between" interval[0], "and", interval[1])

## One sample t-test

In [None]:
# Define hypotheses
alpha = 0.05  # Significance level

# Perform one-sample t-test
test_results = stats.ttest_1samp(df[df["cut"] == "Premium"]["price"], popmean=4500)

p_value = test_results[1]

# Check p-value
if p_value < alpha:
    print("Reject the null hypothesis with p-value", p_value)
else:
    print("Fail to reject the null hypothesis with p-value", p_value)

## Two-sample t-test

In [None]:
# Define sample groups
prices_good = df[df["cut"] == "Good"]["price"]
prices_very_good = df[df["cut"] == "Very Good"]["price"]

# Perform two-sample t-test
test_results = stats.ttest_ind(prices_good, prices_very_good)

p_value = test_results[1]

# Check p-value
if p_value < alpha:
    print("Reject the null hypothesis with p-value", p_value)
else:
    print("Fail to reject the null hypothesis with p-value", p_value)


## Simulations Using Random Numbers
Model potential outcomes using random sampling.

Use np.random.normal() to generate a sample from a normal distribution

In [None]:
sample = np.random.normal(loc=3932, scale=750, size=1000)