<a href="https://colab.research.google.com/github/valeria-edulabs/ai-experts/blob/main/meeting15/museum-walkins.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

#  (high-level, simple to use)
import plotly.express as px
# (low-level, highly customizable)
import plotly.graph_objects as go


In [None]:
data_path = "https://storage.googleapis.com/edulabs-public-datasets/museum-walkins.csv"

In [None]:
# Set Plotly as Pandas plotting backend

pd.options.plotting.backend = "plotly"

# Loading data

In [None]:
df = pd.read_csv(data_path, parse_dates=["date"], dayfirst=True, header=0, usecols=[1, 3, 4, 5, 6], names=["date", "type", "amnt", "weather", "exhibition"])

In [None]:
df

In [None]:
df['weekday'] = df['date'].dt.day_name()

## General trend

In [None]:
visits_per_date = df[['date', 'amnt']].groupby("date").sum()

In [None]:
visits_per_date.plot()

In [None]:
df_monthly = df[['date', 'amnt']].groupby(pd.Grouper(key="date", freq="ME")).sum()
df_monthly.plot()

### Rolling window

In [None]:
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html#pandas.DataFrame.resample
df.set_index('date').resample('2ME')['amnt'].mean().plot()

### Moving average (smoothing)

Moving averages are a valuable tool in data analysis and are used for a variety of reasons, primarily to smooth out data and identify underlying trends. Here's a breakdown of why you might need a moving average:

**1. Identifying Trends:**

*   **Smoothing out noise:** Raw data often contains a lot of random fluctuations or "noise" that can obscure the underlying trend. Moving averages smooth out these fluctuations, making it easier to see the bigger picture.
*   **Highlighting direction:** By averaging data over a specific period, moving averages help to clarify the direction in which the data is moving (upward, downward, or sideways).

**2. Reducing the Impact of Outliers:**

*   **Minimizing distortion:** Outliers, or extreme values, can significantly distort the perception of data. Moving averages reduce the impact of outliers by averaging them with other data points, providing a more representative view of the overall trend.

**3. Forecasting and Prediction:**

*   **Extrapolating trends:** In some cases, moving averages can be used to extrapolate trends and make predictions about future values. By observing the direction and slope of a moving average, analysts might infer potential future movements in the data.



In [None]:
# play with window size to see the trends
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html#pandas.DataFrame.rolling
df.set_index('date')['amnt'].rolling(window=20).mean().plot()

## Visitors per year

In [None]:
df['year'] = df['date'].dt.year

In [None]:
df[['year', 'amnt']].groupby('year').mean()

## How do season and weather affect visitor traffic?

In [None]:
def get_season(month):
    if month in [12, 1, 2]:
        return "Winter"
    elif month in [3, 4, 5]:
        return "Spring"
    elif month in [6, 7, 8]:
        return "Summer"
    else:
        return "Autumn"
df["season"] = df["date"].dt.month.map(get_season)

In [None]:
visitors_per_season_weather = df[['amnt', 'season', 'weather']].groupby(['season', 'weather']).sum()

In [None]:
visitors_per_season_weather

In [None]:
# Pivot the table: make seasons the columns and weather the rows
df_pivot = visitors_per_season_weather.pivot_table(index="weather", columns="season", values="amnt", aggfunc="sum")

# Display the transformed table
print(df_pivot)

In [None]:
fig = px.imshow(df_pivot.values,
                labels=dict(x="Season", y="Weather", color="Count"),
                x=df_pivot.columns,
                y=df_pivot.index,
                text_auto=True,  # Display values inside the heatmap
                color_continuous_scale="Blues")  # Change to Reds, Viridis, etc.

fig.update_layout(title="Total visits by weather / season")
fig.show()

How many days we have for each season / weather?

In [None]:
crosstab = pd.crosstab(df["season"], df["weather"]).transpose()
print(crosstab)

In [None]:
fig = px.imshow(crosstab.values,
                labels=dict(x="Weather", y="Season", color="Count"),
                x=crosstab.columns,
                y=crosstab.index,
                text_auto=True,  # Display values inside the heatmap
                color_continuous_scale="Blues")  # Change to Reds, Viridis, etc.

fig.update_layout(title="Weather Distribution by Season")
fig.show()


Normalize by Available Days per Weather Type

In [None]:
df[['weather', 'amnt']].groupby(['weather']).mean()

Normalization 2

In [None]:
df[['season', 'weather', 'amnt']].groupby(['season', 'weather']).agg(['mean', 'median', 'std'])

In [None]:
normalized_amnts = df[['season', 'weather', 'amnt']].groupby(['season', 'weather']).mean()

In [None]:
df_pivot_normalized = normalized_amnts.pivot_table(index="weather", columns="season", values="amnt", aggfunc="sum")

# Display the transformed table
print(df_pivot_normalized)

In [None]:
fig = px.imshow(df_pivot_normalized.values,
                labels=dict(x="Season", y="Weather", color="Count"),
                x=df_pivot.columns,
                y=df_pivot.index,
                text_auto=True,  # Display values inside the heatmap
                color_continuous_scale="Blues")  # Change to Reds, Viridis, etc.

fig.update_layout(title="Normalized visits by weather / season")
fig.show()

## Distributions

In [None]:
df['season_weather'] = df['season'] + '/' + df['weather']
fig = px.box(df, x=['season_weather'], y="amnt")
fig.update_layout(yaxis=dict(range=[0, 100]))
fig.show()

In [None]:
fig = px.histogram(df, x="amnt", nbins=550, marginal="box",)
fig.update_layout(xaxis=dict(range=[0, 80]))

In [None]:
df['amnt'].agg(["min", "max", "mean", "median", "std"])

In [None]:
df[df['weekday'] == 'Saturday']['amnt'].agg(["min", "max", "mean", "median", "std"])

### Per type

In [None]:
fig = px.box(df, x="type", y="amnt", title="Boxplot per Type", color='type', height=800)
fig.update_layout(yaxis=dict(range=[0, 200]))  # Set y-axis range from 0 to 200
fig.show()

### Per weekday

In [None]:
fig = px.box(df, x="weekday", y="amnt", title="Boxplot per Weekday", color='season', height=600)
fig.update_layout(yaxis=dict(range=[0, 120]))  # Set y-axis range from 0 to 200
fig.show()