<a href="https://colab.research.google.com/github/codeofarmour/Pandas-Data-Science-Tasks/blob/master/JovianDA_Project_Walkthrough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Python Questions from Stack Overflow Data Analysis**

A list of all Python-tagged questions on stackoverflow.com asked between August 2, 2008 and October 19, 2016.

**TODO:** *Write an introduction to your project. Describe the dataset, where you got it from, what you're trying to do with it, and which tools and techniques you're using.*

###**Steps to follow:**


*   ~~Select a real-world dataset~~
*   Perform data preparation and cleaning using Pandas & NumPy
* Ask & answer questions bout the data in Google Colab
* Summarize your inferences and write a conclusion
* Document, publish, and present the Colab notebook online

### Install `opendatasets`, a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command.

[Find it here!](https://pypi.org/project/opendatasets/)

In [1]:
!pip install opendatasets --upgrade

Collecting opendatasets
  Downloading opendatasets-0.1.20-py3-none-any.whl (14 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.20


### Import necessary libraries

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import opendatasets as od 
import os


### **Step 1: Select a real-world dataset**

I selected a dataset from [kaggle.com](https://kaggle.com).

According to [Wikipedia](https://en.wikipedia.org/wiki/Kaggle), Kaggle is best described as an online community that, 'allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.'

In [11]:
dataset_url = 'https://www.kaggle.com/stackoverflow/pythonquestions?select=Questions.csv'
od.download(dataset_url)

Skipping, found downloaded files in "./pythonquestions" (use force=True to force download)


#### Set the directory for new dataset

This will allow you to easily reference and access all of the related files stored within your project folder.

In [9]:
data_dir = './pythonquestions' 

In [10]:
os.listdir(data_dir)

['Tags.csv', 'Questions.csv', 'Answers.csv']

### **Step 2: Data Preparation and Cleaning**

* ~~Load the dataset into a dataframe using Pandas~~
* ~~Explore the rows & columns, ranges of values, etc.~~
* Handle missing, incorrect, and invalid data
* Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset, etc)

#### Task 1: Load the file you wish to work with

In [13]:
df = pd.read_csv(data_dir + '/Questions.csv')

UnicodeDecodeError: ignored

#### Encountered the following error: 
`UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 1821: invalid continuation byte`

I found a possible solution on, ironically enough, [stackoverflow.com]('https://stackoverflow.com/questions/46180610/python-3-unicodedecodeerror-how-do-i-debug-unicodedecodeerror')

In [32]:
# import chardet

# # Attempting to reveal which encoding is used throughout the file
# rawdata = open(data_dir + '/Questions.csv', 'rb').read()
# result = chardet.detect(rawdata)
# charenc = result['encoding']

KeyboardInterrupt: ignored

In [16]:
# for line in open('u.item', encoding = "ISO-8859-1"):
_map = {
    # dashes
    0x13: '\u2013', 0x14: '\u2014',
    # single quotes
    0x18: '\u2018', 0x19: '\u2019',
    # double quotes
    0x1c: '\u201c', 0x1d: '\u201d',
}
def repair(line, _map=_map):
    """Repair mis-encoded SEC data. Assumes line was decoded as Latin-1"""
    return line.translate(_map)

def tags(data_dir):
    """Yield Tag instances from tag.txt."""
    with open(data_dir, 'r', encoding='utf-8', errors='strict') as f:
        fields = next(f).strip().split('\t')
        for line in f:
            yield process_tag_record(fields, line)

In [25]:
df = pd.read_csv(data_dir)

ParserError: ignored

#### None of the above have solved the UnicodeDecodeError, so I searched on YouTube and found the following [video]('https://www.youtube.com/watch?v=0gjbunAe5ck').

By adding `engine='python'`,  you are telling Pandas that you are trying to read the file in Python.

I think it worked :D

In [33]:
# questions_df = pd.read_csv(data_dir + '/Questions.csv'), delimiter='t', encoding='UTF-16')
df = pd.read_csv(data_dir + '/Questions.csv', engine='python')

In [39]:
# The shape() function returns a tuple of metadata: here we'll see the total (number of rows, columns)
print(df.shape)

(607282, 6)


Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...
...,...,...,...,...,...,...
607277,40143190,333403.0,2016-10-19T23:36:01Z,1,How to execute multiline python code from a ba...,<p>I need to extend a shell script (bash). As ...
607278,40143228,6662462.0,2016-10-19T23:40:00Z,0,How to get google reCaptcha image source using...,<p>I understood that reCaptcha loads a new fra...
607279,40143267,4064680.0,2016-10-19T23:44:07Z,0,Updating an ManyToMany field with Django rest,<p>I'm trying to set up this API so I can use ...
607280,40143338,7044980.0,2016-10-19T23:52:27Z,2,Most possible pairs,"<p>Given a list of values, and information on ..."


#### Task 2: Explore the rows & columns, ranges of values, etc.

In [36]:
# The Pandas info() function prints out a summary of the DataFrame
# DataFrame.info(self, verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607282 entries, 0 to 607281
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Id            607282 non-null  int64  
 1   OwnerUserId   601070 non-null  float64
 2   CreationDate  607282 non-null  object 
 3   Score         607282 non-null  int64  
 4   Title         607282 non-null  object 
 5   Body          607282 non-null  object 
dtypes: float64(1), int64(2), object(3)
memory usage: 27.8+ MB


In [37]:
# The Pandas describe() function prints out the DataFrame's descriptive statistics
# DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)

df.describe()

Unnamed: 0,Id,OwnerUserId,Score
count,607282.0,601070.0,607282.0
mean,23719600.0,2519595.0,2.283137
std,11247150.0,1910375.0,19.285578
min,469.0,25.0,-44.0
25%,14855190.0,853934.0,0.0
50%,25318970.0,2107677.0,1.0
75%,33588230.0,3991164.0,2.0
max,40143360.0,7044992.0,5524.0


#### Task 3: Handle missing, incorrect, and invalid data