## Introduction

In this competition, you’re challenged to build predictive algorithms for different subjective aspects of question-answering. The question-answer pairs were gathered from nearly 70 different websites, in a "common-sense" fashion. Our raters received minimal guidance and training, and relied largely on their subjective interpretation of the prompts. As such, each prompt was crafted in the most intuitive fashion so that raters could simply use their common-sense to complete the task. By lessening our dependency on complicated and opaque rating guidelines, we hope to increase the re-use value of this data set. What you see is what you get!

In [None]:
# Loading packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import missingno as msno
from wordcloud import WordCloud

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.offline as py
py.init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Reading data

train = pd.read_csv("../input/google-quest-challenge/train.csv")
Sample = pd.read_csv("../input/google-quest-challenge/sample_submission.csv")

## Statistical Analysis

In [None]:
# Print first few rows of train data

train.head()

In [None]:
# Shape of train data

train.shape

In [None]:
# Some basic info of train data

train.info()

In [None]:
# Describe train data

train.describe()

In [None]:
# Let's see the list of column names

list(train.columns[1:])

#### Check null/nan values

In [None]:
msno.matrix(train)

We have full dataset that means we don't have missing values in our data.

**There are two types of features we have in our dataset. Let's see them:**

* **Categorical features**

In [None]:
train.select_dtypes(include = ['object']).columns.values

* **Numerical features**

In [None]:
train.select_dtypes(include = ['float64', 'int64']).columns.values

## Data Exploration

#### Question Title

In [None]:
train['question_title'].value_counts().head(30)

There are so many duplicate questions in our train data. Let's check number of unique questions.

In [None]:
len(train['question_title'].unique())

We have 3583 unique questions. 

### WordCloud

* Question Title
* Question Body
* Answer

Let's check which words are used most

In [None]:
# Question Title

wordcloud = WordCloud(width = 1000, height = 600, max_font_size = 200, max_words = 150, 
                      background_color='white').generate(" ".join(train.question_title))

plt.figure(figsize=[10,10])
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
# Question Body

wordcloud = WordCloud(width = 1000, height = 600, max_font_size = 200, max_words = 150, 
                      background_color='white').generate(" ".join(train.question_body))

plt.figure(figsize=[10,10])
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
# Answer

wordcloud = WordCloud(width = 1000, height = 600, max_font_size = 200, max_words = 150,
                      background_color='white').generate(" ".join(train.answer))

plt.figure(figsize=[10,10])
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

* **Question Title**

  > using, Window, function, user, time, file, use, value, change, one
   
   
* **Question Body**
   
  > gt, lt, using, use, one, will, know, new, user
   
   
* **Answer**
 
  > gt, lt, use, one, using, need, will, time, way, file


Seems like most of the words are common in all the three WordClouds.

### Category

Let's see the categories:

In [None]:
Cat = train['category'].value_counts()

fig = go.Figure([go.Bar(x=Cat.index, y=Cat)])
fig.update_layout(title = "Count of categories")
py.iplot(fig, filename='test')

We have 5 categories in our data. Technology is the highest among them.

#### Host

The data includes questions and answers from various StackExchange properties.

In [None]:
Host = train['host'].value_counts()

fig = go.Figure(data = [go.Scatter(x = Host.index, y = Host.values)])
fig.update_layout(title = "Distribution of Host")
py.iplot(fig, filename='test')

Most of the data is collected from Stackoverflow.com

We have 30 target variables. Let's see them:

In [None]:
targetCol = list(Sample.columns[1:])

In [None]:
train[targetCol].values

We can clearly see that our target variables are not binary, they are continous. They are in a range of 0 and 1.

Let's see how they are correlated.

In [None]:
corr = train[targetCol].corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(15, 14))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,square=True, linewidths=.5, cbar_kws={"shrink": .5})

We can clearly see the features that are correlated like 'question_type_instructions' and 'answer_type_instructions'. Let's see how:

In [None]:
sns.relplot(x="question_type_instructions", y="answer_type_instructions", data=train)

In [None]:
sns.relplot(y="question_opinion_seeking", x="question_fact_seeking", data=train)

It seems **question opinion seeking** and **question_fact_seeking** are **autocorrelated**.

I will update my kernel. 

Stay tuned for more!!

**Please UPVOTE if you find it useful or leave a comment if you have any queries.**