## Import Statements

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

## Data Exploration

Use the provided .csv file that contains data on the popularity of various programming languages over time. The data shows the number of [Stack Overflow](https://stackoverflow.com/) posts per month and programming language.

**Challenge**: Read the .csv file and store it in a Pandas dataframe

In [4]:
df = pd.read_csv('QueryResults.csv', names=['DATE', 'TAG', 'POSTS'], header=0)

**Challenge**: Examine the first 5 rows and the last 5 rows of the of the dataframe

In [5]:
df.head()

Unnamed: 0,DATE,TAG,POSTS
0,2008-07-01 00:00:00,c#,3
1,2008-08-01 00:00:00,assembly,8
2,2008-08-01 00:00:00,c,82
3,2008-08-01 00:00:00,c#,503
4,2008-08-01 00:00:00,c++,164


In [6]:
df.tail()

Unnamed: 0,DATE,TAG,POSTS
2686,2024-09-01 00:00:00,php,659
2687,2024-09-01 00:00:00,python,3974
2688,2024-09-01 00:00:00,r,780
2689,2024-09-01 00:00:00,ruby,85
2690,2024-09-01 00:00:00,swift,552


**Challenge:** Check how many rows and how many columns there are. 
What are the dimensions of the dataframe?

In [None]:
df.shape

**Challenge**: Count the number of entries in each column of the dataframe

In [7]:
df.count()

DATE     2691
TAG      2691
POSTS    2691
dtype: int64

**Challenge**: Calculate the total number of post per language.
Which Programming language has had the highest total number of posts of all time?

In [8]:
df.groupby('TAG').agg({'POSTS': 'sum'})

Unnamed: 0_level_0,POSTS
TAG,Unnamed: 1_level_1
assembly,44746
c,406177
c#,1621433
c++,810965
delphi,52180
go,73799
java,1918393
javascript,2532022
perl,68225
php,1467305


Some languages are older (e.g., C) and other languages are newer (e.g., Swift). The dataset starts in September 2008.

**Challenge**: How many months of data exist per language? Which language had the fewest months with an entry? 


In [None]:
df.groupby('TAG')['DATE'].count()

## Data Cleaning

Let's fix the date format to make it more readable. We need to use Pandas to change format from a string of "2008-07-01 00:00:00" to a datetime object with the format of "2008-07-01"

In [11]:
df['DATE'][1]

Timestamp('2008-08-01 00:00:00')

In [12]:
type(df['DATE'][1])

pandas._libs.tslibs.timestamps.Timestamp

In [9]:
print(pd.to_datetime(df['DATE'][1]))
type(pd.to_datetime(df['DATE'][1]))

2008-08-01 00:00:00


pandas._libs.tslibs.timestamps.Timestamp

In [10]:
# Convert Entire Column
df['DATE'] = pd.to_datetime(df['DATE'])
df.head()

Unnamed: 0,DATE,TAG,POSTS
0,2008-07-01,c#,3
1,2008-08-01,assembly,8
2,2008-08-01,c,82
3,2008-08-01,c#,503
4,2008-08-01,c++,164


## Data Manipulation



**Challenge:** Reshape the dataframe such that each row is a date and each column is a tag. The values are the post counts.

In [None]:
reshaped_df = df.pivot(index='DATE', columns='TAG', values='POSTS')

**Challenge**: What are the dimensions of our new dataframe? How many rows and columns does it have? Print out the column names and print out the first 5 rows of the dataframe.

In [None]:
reshaped_df.shape

In [None]:
reshaped_df.columns

In [None]:
reshaped_df.head()

**Challenge**: Count the number of entries per programming language. Why might the number of entries be different? 

In [None]:
reshaped_df.count()

**Challenge**: Fill in the missing values with 0. Once again, check the number of entries per programming language and print the first 5 rows of the dataframe.

In [None]:
reshaped_df.fillna(0, inplace=True) 

In [None]:
reshaped_df.head()

In [None]:
reshaped_df.isna().values.any()

In [None]:
reshaped_df.count()

## Data Visualizaton with with Matplotlib

We will dive deeper into data visualization in the next week, but here, you can already get a first glimpse at the power of data visualization with Python.

Note: You might need to install the matplotlib library first. Look up [here](https://pypi.org/project/matplotlib/) how to do it.

**Challenge**: Use the [matplotlib documentation](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot) to plot a single programming language (e.g., java) on a chart.

In [None]:
plt.plot(reshaped_df.index, reshaped_df['java'])

In [None]:
plt.figure(figsize=(10,6))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Number of Posts', fontsize=14)
plt.ylim(0, 35000)
plt.plot(reshaped_df.index, reshaped_df.java)

**Challenge**: Show two lines (e.g. for Java and Python) on the same chart.

In [None]:
plt.figure(figsize=(10,6)) # make chart larger
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Number of Posts', fontsize=14)
plt.ylim(0, 35000)

plt.plot(reshaped_df.index, reshaped_df.java)
plt.plot(reshaped_df.index, reshaped_df.python)

**Challenge**: Plot all lines (i.e. for all programming languages) on the same chart.

In [None]:
plt.figure(figsize=(10,6))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Number of Posts', fontsize=14)
plt.ylim(0, 35000)

for column in reshaped_df.columns:
    plt.plot(reshaped_df.index, reshaped_df[column], 
             linewidth=3, label=reshaped_df[column].name)

plt.legend(fontsize=16)

# Smoothing out Time Series Data

Time series data can be quite noisy, with a lot of up and down spikes. To better see a trend we can plot an average of, say 6 or 12 observations. This is called the rolling mean. We calculate the average in a window of time and move it forward by one overservation. Pandas has two handy methods already built in to work this out: [rolling()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html) and [mean()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.window.rolling.Rolling.mean.html). 

In [None]:
# The window is number of observations that are averaged
roll_df = reshaped_df.rolling(window=6).mean()

plt.figure(figsize=(10,6))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Number of Posts', fontsize=14)
plt.ylim(0, 35000)

# plot the roll_df instead
for column in roll_df.columns:
    plt.plot(roll_df.index, roll_df[column], 
             linewidth=3, label=roll_df[column].name)

plt.legend(fontsize=14)