## Import Statements

In [2]:
import pandas as pd

## Data Exploration

Use the provided .csv file that contains data on the popularity of various programming languages over time. The data shows the number of [Stack Overflow](https://stackoverflow.com/) posts per month and programming language.

**Challenge**: Read the .csv file and store it in a Pandas dataframe

In [7]:
df = pd.read_csv('QueryResults.csv', names=["DATE", "NAME", "POSTS"])

**Challenge**: Examine the first 5 rows and the last 5 rows of the of the dataframe

In [13]:
df.head()

Unnamed: 0,DATE,NAME,POSTS
0,m,TagName,
1,2008-07-01 00:00:00,c#,3.0
2,2008-08-01 00:00:00,assembly,8.0
3,2008-08-01 00:00:00,c,82.0
4,2008-08-01 00:00:00,c#,503.0


In [14]:
df.tail()

Unnamed: 0,DATE,NAME,POSTS
2687,2024-09-01 00:00:00,php,659.0
2688,2024-09-01 00:00:00,python,3974.0
2689,2024-09-01 00:00:00,r,780.0
2690,2024-09-01 00:00:00,ruby,85.0
2691,2024-09-01 00:00:00,swift,552.0


**Challenge:** Check how many rows and how many columns there are. 
What are the dimensions of the dataframe?

In [17]:
df.shape

(2692, 3)

**Challenge**: Count the number of entries in each column of the dataframe

In [18]:
df.count()

DATE     2692
NAME     2692
POSTS    2691
dtype: int64

**Challenge**: Calculate the total number of post per language.
Which Programming language has had the highest total number of posts of all time?

In [24]:
df.groupby('NAME')['POSTS'].sum().astype(int).reset_index()

Unnamed: 0,NAME,POSTS
0,TagName,0
1,assembly,44746
2,c,406177
3,c#,1621433
4,c++,810965
5,delphi,52180
6,go,73799
7,java,1918393
8,javascript,2532022
9,perl,68225


Some languages are older (e.g., C) and other languages are newer (e.g., Swift). The dataset starts in September 2008.

**Challenge**: How many months of data exist per language? Which language had the fewest months with an entry? 


In [32]:
df.groupby('NAME')['DATE'].nunique().reset_index()

Unnamed: 0,NAME,DATE
0,TagName,1
1,assembly,194
2,c,194
3,c#,195
4,c++,194
5,delphi,194
6,go,179
7,java,194
8,javascript,194
9,perl,194


## Data Cleaning

Let's fix the date format to make it more readable. We need to use Pandas to change format from a string of "2008-07-01 00:00:00" to a datetime object with the format of "2008-07-01"

In [33]:
df['DATE'][1]

'2008-07-01 00:00:00'

## Data Manipulation



**Challenge:** Reshape the dataframe such that each row is a date and each column is a tag. The values are the post counts.

In [34]:
type(df['DATE'][1])

str

**Challenge**: What are the dimensions of our new dataframe? How many rows and columns does it have? Print out the column names and print out the first 5 rows of the dataframe.

**Challenge**: Count the number of entries per programming language. Why might the number of entries be different? 

**Challenge**: Fill in the missing values with 0. Once again, check the number of entries per programming language and print the first 5 rows of the dataframe.

## Data Visualizaton with with Matplotlib

We will dive deeper into data visualization in the next week, but here, you can already get a first glimpse at the power of data visualization with Python.

Note: You might need to install the matplotlib library first. Look up [here](https://pypi.org/project/matplotlib/) how to do it.

**Challenge**: Use the [matplotlib documentation](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot) to plot a single programming language (e.g., java) on a chart.

**Challenge**: Show two lines (e.g. for Java and Python) on the same chart.

**Challenge**: Plot all lines (i.e. for all programming languages) on the same chart.

# Smoothing out Time Series Data

Time series data can be quite noisy, with a lot of up and down spikes. To better see a trend we can plot an average of, say 6 or 12 observations. This is called the rolling mean. We calculate the average in a window of time and move it forward by one overservation. Pandas has two handy methods already built in to work this out: [rolling()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html) and [mean()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.window.rolling.Rolling.mean.html). 

**Challenge:** Plot the rolling average for all programming languages on a chart.