# Popular Data Science Questions

[Stack Exchange](https://stackexchange.com/) is an extremely popular Q & A platform for a variety of software, IT, math, and "pretty much any other technical" question you can think of. 

They are most well known for their software question and answer platform [Stack Overflow](https://stackoverflow.com/).

For this project we will be taking a look at some of the more popular questions on a less known Stack Exchange site: [Data Science Stack-Exchange](https://datascience.stackexchange.com/).

### Questions for Data Science Stack Exchange

 - What kind of questions are welcome
    - Questions about machine learning, algorithims, and any other relevant data science question.
 
 - How does the home screen subdivide their questions:
    - Active, Bountied, Hot, Week, and Month
 
 - They also divide the navigation bar into several different sections:

      1. Home 
      2. Questions 
      3. Tags 
      4. Users
      5. Unanswered

Luckily Stack Exchange provides an open source [Data Base](https://data.stackexchange.com/datascience/query/new) we can query to find useful information. 

#### Promising Tables:

 1. Posts
 2. Users 
 3. Votes
 4. Tags
 5. PostTypes


I created a query that collected these items from Stack Exchanges's website.

*SELECT Id, PostTypeId, CreationDate, Score, ViewCount,
       Tags, AnswerCount, FavoriteCount*

 *FROM posts*

 *WHERE PostTypeID = 1 OR PostTypeID = 2 
   AND CreationDate BETWEEN '2019-01-01' AND '2021-01-01';*

---


This queried all the selected columns and only included questions, answers, and recent posts.

**Specifically between 2019 and 2020.**

### Analysis:

Let's start by reading in the dataset that we downloaded and exploring the data frame.

In [1]:
import numpy as np 
import pandas as pd

In [2]:
ds_q_and_a = pd.read_csv('__data__/QueryResults.csv')

In [3]:
# Exploring the data
ds_q_and_a.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42421 entries, 0 to 42420
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             42421 non-null  int64  
 1   PostTypeId     42421 non-null  int64  
 2   CreationDate   42421 non-null  object 
 3   Score          42421 non-null  int64  
 4   ViewCount      27649 non-null  float64
 5   Tags           27649 non-null  object 
 6   AnswerCount    27649 non-null  float64
 7   FavoriteCount  7627 non-null   float64
dtypes: float64(3), int64(3), object(2)
memory usage: 2.6+ MB


In [4]:
# Separating answers and questions
ds_questions = ds_q_and_a[ds_q_and_a['PostTypeId'] == 1].copy()
ds_answers = ds_q_and_a[ds_q_and_a['PostTypeId'] == 2].copy()

In [5]:
ds_questions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27649 entries, 0 to 42420
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             27649 non-null  int64  
 1   PostTypeId     27649 non-null  int64  
 2   CreationDate   27649 non-null  object 
 3   Score          27649 non-null  int64  
 4   ViewCount      27649 non-null  float64
 5   Tags           27649 non-null  object 
 6   AnswerCount    27649 non-null  float64
 7   FavoriteCount  7627 non-null   float64
dtypes: float64(3), int64(3), object(2)
memory usage: 1.9+ MB


In [6]:
ds_questions.head()

Unnamed: 0,Id,PostTypeId,CreationDate,Score,ViewCount,Tags,AnswerCount,FavoriteCount
0,2627,1,2014-12-06 00:41:24,4,266.0,<javascript><visualization>,0.0,
1,2628,1,2014-12-06 01:10:30,2,511.0,<logistic-regression>,0.0,
2,2629,1,2014-12-06 06:53:14,3,380.0,<bigdata><definitions>,1.0,
3,2631,1,2014-12-06 15:04:03,7,1161.0,<machine-learning><data-mining><clustering><an...,5.0,1.0
4,2632,1,2014-12-06 17:56:53,3,35.0,<efficiency><map-reduce><performance><experime...,1.0,


## Initial Analysis:

How many missing values are in each column:

 - It looks like only *FavoriteCount* contains any missing rows.

---

Use ds_questions.info() to examine the *data types* of each column. 

In [7]:
# Fill the null values with 0
ds_questions.fillna(0, inplace=True)

In [8]:
# Change the CreationDate to Datetime object
ds_questions.loc['CreationDate'] = pd.to_datetime(ds_questions['CreationDate'])

In [9]:
# Convert the tags column into a comma separated column
pattern = r'><'
def convert_column(string):
    new_string = str(string).replace(pattern,",")
    new_string = new_string.replace(">","")
    new_string = new_string.replace("<","")
    return new_string

convert_column('<HELLO><WORLD>')

'HELLO,WORLD'

In [10]:
# Apply the conversion done above
ds_questions['Tags'] = ds_questions['Tags'].apply(convert_column)

In [12]:
# Make sure everything looks right. 
ds_questions.head()

Unnamed: 0,Id,PostTypeId,CreationDate,Score,ViewCount,Tags,AnswerCount,FavoriteCount
0,2627,1,2014-12-06 00:41:24,4,266.0,"javascript,visualization",0.0,0.0
1,2628,1,2014-12-06 01:10:30,2,511.0,logistic-regression,0.0,0.0
2,2629,1,2014-12-06 06:53:14,3,380.0,"bigdata,definitions",1.0,0.0
3,2631,1,2014-12-06 15:04:03,7,1161.0,"machine-learning,data-mining,clustering,anomal...",5.0,1.0
4,2632,1,2014-12-06 17:56:53,3,35.0,"efficiency,map-reduce,performance,experiments",1.0,0.0
