 # <div style="text-align: center">Probability of Earthquake  </div> 
### <div style="text-align: center">Quite Practical and Far from any Theoretical Concepts
 
 </div> 
<img src='https://cdn-images-1.medium.com/max/800/1*ZqMOkymLG5oSuUuVAKSlpg.png' width=400 height=400>
<div style="text-align:center">last update: <b>11/01/2019</b></div>



You can Fork code  and  Follow me on:

> ###### [ GitHub](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist)
> ###### [Kaggle](https://www.kaggle.com/mjbahmani/)
-------------------------------------------------------------------------------------------------------------
 <b>I hope you find this kernel helpful and some <font color='red'>UPVOTES</font> would be very much appreciated.</b>
    
 -----------

 <a id="top"></a> <br>
## Notebook  Content
1. [Introduction](#1)
1. [Basic Concepts](#2)
1. [Conclusion](#3)
1. [References](#4)

 <a id="1"></a> <br>
## 1- Introduction
**Forecasting earthquakes** is one of the most important problems in Earth science. If you agree, the earthquake forecast is likely to be related to the concepts of **probability**. In this kernel, I try to look at the prediction of the earthquake with the **help** of the concepts of probability .
<img src='https://www.livemint.com/r/LiveMint/WebArchive/BP/Photos/2015-06-16/Processed/Mint/Web/earthquake-1.jpg'>
For anyone taking first steps in data science, **Probability** is a must know concept. Concepts of probability theory are the backbone of many important concepts in data science like inferential statistics to Bayesian networks. It would not be wrong to say that the journey of mastering statistics begins with probability.

 <a id="21"></a> <br>
## 2-1 Import

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
import matplotlib.patches as patch
import matplotlib.pyplot as plt
from sklearn.svm import NuSVR
from scipy.stats import norm
from scipy import linalg
from sklearn import svm
import tensorflow as tf
from tqdm import tqdm
import seaborn as sns
import pandas as pd
import numpy as np
import glob
import sys
import os

 <a id="22"></a> <br>
##  2-2 Setup

In [None]:
%matplotlib inline
%precision 4
plt.style.use('ggplot')
np.set_printoptions(suppress=True)
pd.set_option("display.precision", 15)

 <a id="23"></a> <br>
## 2-3 Version


In [None]:
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))

<a id="3"></a> 
<br>
## 3- Problem Definition
I think one of the important things when you start a new machine learning project is Defining your problem. that means you should understand business problem.( **Problem Formalization**)

Problem Definition has four steps that have illustrated in the picture below:
<img src="http://s8.picofile.com/file/8338227734/ProblemDefination.png">

**Current scientific studies related to earthquake forecasting focus on three key points:** 
1. when the event will occur
1. where it will occur
1. how large it will be.


### 3-1 Problem Feature

1.     train.csv - A single, continuous training segment of experimental data.
1.     test - A folder containing many small segments of test data.
1.     sample_sumbission.csv - A sample submission file in the correct format.


### 3-2 Aim
In this competition, you will address <font color='red'><b>WHEN</b></font> the earthquake will take place

### 3-3 Variables

1.     acoustic_data - the seismic signal [int16]
1.     time_to_failure - the time (in seconds) until the next laboratory earthquake [float64]
1.     seg_id - the test segment ids for which predictions should be made (one prediction per segment)


## 3-4 evaluation
Submissions are evaluated using the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) between the predicted time remaining before the next lab earthquake and the act remaining time.
<img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/3ef87b78a9af65e308cf4aa9acf6f203efbdeded'>

## 4- Exploratory Data Analysis(EDA)
 In this section, we'll analysis how to use graphical and numerical techniques to begin uncovering the structure of your data. 
 
* Which variables suggest interesting relationships?
* Which observations are unusual?
* Analysis of the features!
By the end of the section, you'll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.  then We will review analytical and statistical operations:

*  Data Collection
*  Visualization
*  Data Preprocessing
*  Data Cleaning

 <a id="14"></a> <br>
## 4-1 Data Collection

In [None]:
print(os.listdir("../input/"))

In [None]:
# import Dataset to play with it
train= pd.read_csv("../input/train.csv",nrows=1000)

In [None]:
sample_submission = pd.read_csv('../input/sample_submission.csv')
sample_submission.head()

In [None]:
train.head()

In [None]:
train.describe()

In [None]:
train.shape

In [None]:
train.isna().sum()

In [None]:
type(train)

 <a id="15"></a> <br>
## 1-5 Visualization

In [None]:
train["acoustic_data"].hist();

In [None]:
pd.plotting.scatter_matrix(train,figsize=(10,10))
plt.figure()

In [None]:
sns.jointplot(x='acoustic_data',y='time_to_failure' ,data=train, kind='reg')

In [None]:
sns.swarmplot(x='time_to_failure',y='acoustic_data',data=train)

In [None]:

plt.plot(train["time_to_failure"], train["acoustic_data"])
plt.title("time_to_failure histogram")

In [None]:
sns.distplot(train["acoustic_data"])

 <a id="1"></a> <br>
## 2- What is probability?

At the most basic level, probability seeks to answer the question, "What is the chance of an event happening?" An event is some outcome of interest. To calculate the chance of an event happening, we also need to consider all the other events that can occur.

The quintessential representation of probability is the humble coin toss. In a coin toss the only events that can happen are:

    * Flipping a heads
    * Flipping a tails

These two events form the sample space, the set of all possible events that can happen. To calculate the probability of an event occurring, we count how many times are event of interest can occur (say flipping heads) and dividing it by the sample space. Thus, probability will tell us that an ideal coin will have a 1-in-2 chance of being heads or tails. By looking at the events that can occur, probability gives us a framework for making predictions about how often events will happen.
<img src='https://i.imgur.com/GtbawRt.jpg'>
However, even though it seems obvious, if we actually try to toss some coins, we're likely to get an abnormally high or low counts of heads every once in a while. If we don't want to make the assumption that the coin is fair, what can we do? We can gather data! We can use statistics to calculate probabilities based on observations from the real world and check how it compares to the ideal.

In [None]:
train.head()

In [None]:
train.describe()

In [None]:
train.acoustic_data.describe()

In [None]:
train.shape

In [None]:
train.isna().sum()

In [None]:
import random
def coin_trial():
    heads = 0
    for i in range(100):
        if random.random() <= 0.5:
            heads +=1
    return heads

def simulate(n):
   trials = []
   for i in range(n):
       trials.append(coin_trial())
   return(sum(trials)/n)

In [None]:
simulate(10)

## 2-1 Why do we need probability?
In an uncertain world, it can be of immense help to know and understand chances of various events. You can plan things accordingly. If it’s likely to rain, I would carry my umbrella. If I am likely to have diabetes on the basis of my food habits, I would get myself tested. If my customer is unlikely to pay me a renewal premium without a reminder, I would remind him about it.

* So knowing the likelihood might be very beneficial.

## 2-2 Random Variables
To calculate the likelihood of occurence of an event, we need to put a framework to express the outcome in numbers. We can do this by mapping the outcome of an experiment to numbers.

Let’s define X to be the outcome of a coin toss.

1. X = outcome of a coin toss

Possible Outcomes:

1. 1 if heads
1. 0 if tails
1. Let’s take another one.

Suppose, I win the game if I get a sum of 8 while rolling two fair dice. I can define my random variable Y to be (the sum of the upward face of two fair dice )

1. Y can take values = (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)

In [None]:
# Sample Space
cards = 52

# Outcomes
aces = 4

# Divide possible outcomes by the sample set
ace_probability = aces / cards

# Print probability rounded to two decimal places
print(round(ace_probability, 2))

## 3- Binomial Distribution

In [None]:
n, p = 10, .5  # number of trials, probability of each trial

In [None]:
s = np.random.binomial(n, p, 1000)

In [None]:
from scipy.stats import binom
import seaborn as sb

binom.rvs(size=10,n=20,p=0.8)

data_binom = binom.rvs(n=20,p=0.8,loc=0,size=1000)
ax = sb.distplot(data_binom,
                  kde=True,
                  color='blue',
                  hist_kws={"linewidth": 25,'alpha':1})
ax.set(xlabel='Binomial', ylabel='Frequency')

## 4- Continuous random variables

In [None]:
from scipy.stats import rv_continuous

In [None]:
class gaussian_gen(rv_continuous):
    "Gaussian distribution"
    def _pdf(self, x):
        return np.exp(-x**2 / 2.) / np.sqrt(2.0 * np.pi)
gaussian = gaussian_gen(name='gaussian')

## 5- Poisson Distribution
A Poisson distribution is a distribution which shows the likely number of times that an event will occur within a pre-determined period of time. It is used for independent events which occur at a constant rate within a given interval of time. The Poisson distribution is a discrete function, meaning that the event can only be measured as occurring or not as occurring, meaning the variable can only be measured in whole numbers.

We use the seaborn python library which has in-built functions to create such probability distribution graphs. Also the scipy package helps is creating the binomial distribution.

In [None]:
from scipy.stats import poisson
import seaborn as sb

data_binom = poisson.rvs(mu=4, size=10000)
ax = sb.distplot(data_binom,
                  kde=True,
                  color='green',
                  hist_kws={"linewidth": 25,'alpha':1})
ax.set(xlabel='Poisson', ylabel='Frequency')

## 6- Bernoulli Distribution
The Bernoulli distribution is a special case of the Binomial distribution where a single experiment is conducted so that the number of observation is 1. So, the Bernoulli distribution therefore describes events having exactly two outcomes.

We use various functions in numpy library to mathematically calculate the values for a bernoulli distribution. Histograms are created over which we plot the probability distribution curve. 

In [None]:
from scipy.stats import bernoulli
import seaborn as sb

data_bern = bernoulli.rvs(size=1000,p=0.6)
ax = sb.distplot(data_bern,
                  kde=True,
                  color='crimson',
                  hist_kws={"linewidth": 25,'alpha':1})
ax.set(xlabel='Bernouli', ylabel='Frequency')

##  7- Z Scores
We will encounter a lot of cases, where we would need to know the probability for the data to be less than or more than a particular value. This value will not be equal to 1σ or 2σ away from the mean.

The distance in terms of number of standard deviations, the observed value is away from the mean, is the standard score or the Z score.

In [None]:
a = np.array([ 0.7972,  0.0767,  0.4383,  0.7866,  0.8091,
            0.1954,  0.6307,  0.6599,  0.1065,  0.0508])
from scipy import stats
stats.zscore(a)

## 8- P-Value
The p-value is about the strength of a hypothesis. We build hypothesis based on some statistical model and compare the model's validity using p-value. One way to get the p-value is by using T-test.

This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations ‘a’ is equal to the given population mean, popmean. Let us consider the following example.

In [None]:
from scipy import stats
rvs = stats.norm.rvs(loc = 5, scale = 10, size = (50,2))
print (stats.ttest_1samp(rvs,5.0))

## 9- What is Conditional Probability?

Conditional probability is a measure of the probability of an event (some particular situation occurring) given that (by assumption, presumption, assertion or evidence) another event has occurred.
<img src='https://cdn-images-1.medium.com/max/800/0*A_FcopnqXGd_bVVn.gif'>

The probability of event B given event A equals the probability of event A and event B divided by the probability of event A.

you can follow me on:
> ###### [ GitHub](https://github.com/mjbahmani/)
> ###### [Kaggle](https://www.kaggle.com/mjbahmani/)

 <b>I hope you find this kernel helpful and some <font color='red'>UPVOTES</font> would be very much appreciated.<b/>
 

<a id="33"></a> <br>
# 11-References
1. [Basic Probability Data Science with examples](https://www.analyticsvidhya.com/blog/2017/02/basic-probability-data-science-with-examples/)
1. [How to self learn statistics of data science](https://medium.com/ml-research-lab/how-to-self-learn-statistics-of-data-science-c05db1f7cfc3)
1. [Probability statistics for data science- series](https://towardsdatascience.com/probability-statistics-for-data-science-series-83b94353ca48)
1. [basic-statistics-in-python-probability](https://www.dataquest.io/blog/basic-statistics-in-python-probability/)
1. [tutorialspoint](https://www.tutorialspoint.com/python/python_poisson_distribution.htm)

Go to first step: [Course Home Page](https://www.kaggle.com/mjbahmani/10-steps-to-become-a-data-scientist)

Go to next step : [Titanic](https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python)


# Not Completed yet!!!