# The Notebook

Since graduating from a health related course, I began thinking what I want to do for my career. Early on I realized that I no longer see myself in the expected path of becoming a clinician. What followed were several years of searching for my passion. Finally two years ago, I was involved in research and here I was introduced to data. I was thrilled when I got a chance to analyze the findings of the research. Due to this, I was driven to learn more about analyzing data and this led me to data science. Unfortunately, I do not know how to start learning data science. I feel that I am out of my depth since data science was far removed from my finished course. I tried applying to several data science program but I was lucky enough to be accepted. Now I decided to start the journey by myself; getting information from different free courses, blogs or free books. Now this is my notebook.

## Contents
- [Libraries](#Libraries)
- [Coding](#Coding)
- [Data Analytics Framework](#DataAnalyticsFramework)
- [GITBasics](#GITBasics)
- [Python Basics](#PythonBasics)

## Libraries
[back](#Contents)
<br> The following are the most common libraries that are being used in data science

### [NumPy](https://numpy.org/devdocs/user/index.html)

Universal standard for working with numerical data in python. It is used extensively in Pandas, SciPy, Matplotlib, scikit-learn. The library contains multidimensional array and matrix data structures

In [None]:
import numpy as np

### [Scipy](https://docs.scipy.org/doc/scipy/reference/)

General scientific libraries with advanced solver

In [1]:
import scipy

### [Pandas](https://pandas.pydata.org/docs/getting_started/tutorials.html)

a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It has two primary data structures, Series and Dataframe

In [None]:
import pandas as pd

### [Matplotlib](https://matplotlib.org/3.1.1/contents.html)
#### Pyplot

matplotlib.pyplot these functions that make Matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

In [None]:
import matplotlib.pyplot as plt

### [Seaborn](https://seaborn.pydata.org/)

aims to make visualization a central part of exploring and 
understanding data. Its dataset-oriented plotting functions operate on 
dataframes and arrays containing whole datasets and internally perform the 
necessary semantic mapping and statistical aggregation to produce informative 
plots.

In [None]:
import seaborn as sns
sns.set()

### [Bokeh](https://docs.bokeh.org/en/latest/docs/user_guide/quickstart.html)

a library for interactive data visualization. It renders its graphics using HTML and JavaScript which makes it a candidate for building web-based dashboards and applications.

In [None]:
from bokeh.plotting import figure, output_file, show
#this is just an example

### [Scikit-learn](https://scikit-learn.org/stable/)

is a library in Python that provides many unsurpervised and supervised learning algorithms.

In [None]:
from sklearn.ensemble import RandomForestClassifier
#this is just an example

### [Natural Language Toolkit](https://www.guru99.com/nltk-tutorial.html)

is one of the most powerful NLP libraries which contains packages to make machines understand human language and reply to it with an appropriate response.

In [None]:
from nlk.tokenize import RegexpTokenizer

### [GeoPandas](https://geopandas.org/)

is an open source project to make working with geospatial data in python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. Geopandas further depends on fiona for file access and descartes and matplotlib for plotting.
dependicies:
numpy
pandas
shapely
fiona
pyproj
six

In [None]:
import geopandas

## Coding
[back](#Contents)

This section will provide links to guides and standards on how to create a proper and readable code that is well documented.

### [Shortcut](https://yoursdata.net/jupyter-lab-shortcut-and-magic-functions-tips/)

There are several keyboard shortcuts and majic functions that can be use to make life easier.

### [Markdown](https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed)

In jupyter notebooks, markdown is a very useful way to describe your code. Through this one can add texts, images, tables, videos and create hyperlinks to within as well to outside of the document.

### [Style Guide](https://www.python.org/dev/peps/pep-0008/)

PEP 8 is a document that gives coding conventions for the Python code. This gives a guide for programmers on how they format their code but they also indicate that project style guidelines that differs from PEP takes precedent in that particular project. 

## DataAnalyticsFramework
[back](#Contents)

The [Analytics Association of the Philippines](https://aap.ph/) have created a framework to guide the conversation about data analytics by defining key terminologies.

#### Data Analytics Skill

|Skill|Definition|Job|Level 0|Level 1|Level 2|Level 3|
|-----|----------|---|-------|-------|-------|-------|
|**Business and Organizational Skills**|||||||
|Domain Knowledge and application|apply domain related knowledge and insights to contextualize data|Functional Analysts|No skill|Understand collected data, and how they are handled and applied in the specific industry domain.|Develop content strategy and information architecture to support a given industry domain|Make business cases to improve domain-related procedures through data-driven decision-making|
|Data Management and governance|develop and implement data management strategies, incorporating privacy and data security, policies and regulations, and ethical onsiderations.|Data Stewards|No skill|aware and always apply policies and measures to ensure data security, privacy, intellectual property, and ethics.|enforce policies and procedures for data security, privacy, intellectual property, and ethics.|develop policies on data security, privacy, intellectual property, and ethics.|
|Operational Analytics|use general and specialized analytics techniques for the investigation of all relevant data to derive insight for decision-making.|Analytics Managers|No skill|perform business analysis for specified tasks and data sets.|You identify business impact from trends and patterns.|You identify new opportunities to use historical data for organizational processes optimization.|
|Data Visualization and presentation|create and communicate compelling and actionable insights from data using visualization and presentation tools and technologies.||No skill|You prepare data visualization reports or narratives based on provided specifications|You create infographics for effective presentation and communication of actionable outcomes.|You select appropriate and develop new visualization methods used in a specific industry.|
|**Technical skills**|||||||
|Research methods|strategies, processes or techniques utilized in the collection of data or evidence for analysis in order to uncover new information or create better understanding of a topic.|Data Scientists|No skill|You understand and use the 4-step research model: hypothesis, research methods, artifact, evaluation.|You develop research questions around identified issues within existing research or business process models.|You Design experiments which include data collection (passive and active) for hypothesis testing and problem solving.|
|Data engineering principles|They are the ones who will bring all the needed data from the various sources, extract, clean, aggregate, transform, and finally load them to the identified data repositories|Data Engineers|No skill|You have knowledge and ability to program selected SQL and NoSQL platform for data storage and access, in particular write ETL scripts.|You design and build relational and non-relational databases, ensure effective ETL processes for large datasets.|You have advanced knowledge and experience of using modern Big Data technologies to process different data types from multiple sources.|
|Statistical Techniques|Here, mathematical formulas are used in the analysis of raw research data. The application of these techniques extracts information from research data and provides different ways to assess the robustness of research outputs.|Data Scientists|No skill|You know and use statistical methods such as sampling, ANOVA, hypothesis testing, descriptive statistics, regression analysis, and others.|You select and recommend appropriate statistical methods and tools for specific tasks and data.|You identify problems with collected data and suggest corrective measures, including additional data collection, inspection, and pre-processing.|
|Data Analystics Methods and Algorithms|implement and evaluate machine learning methods and algorithms on the data to derive insights for decision- making.|Data Scientists|No skill|You demonstrate understanding and perform statistical hypothesis testing; you can explain statistical significance of collected data.|You apply quantitative techniques (e.g., time series analysis, optimization, simulation) to deploy appropriate models for analysis and prediction.|You assess data on reliability and appropriateness; you select appropriate approaches and their impact on analysis and the quality of the results.|
|Computing|apply information technology, computational thinking, and utilize programming languages and software and hardware solutions for data analysis.|Data Engineers, Data Scientists|No skill|You perform basic data manipulation, analysis, and visualization.|You apply computational thinking to transform formal data models and process algorithms into program code.|You select appropriate application and statistical programming languages, and development platforms for specific processes and data sets.|
|**21st Century Skills**|
|Critical Thinking|Demonstrating the ability to apply critical thinking skills to solve problems and make effective decisions|
|Communication|Understanding and communicating ideas|
|Collaboration|Working with others|
|Creativity and Attitude|Deliver high quality work and focus on final result, initiative, intellectual risk|
|Plannining and Organizing|Planning and prioritizing work to manage time effectively and accomplish tasks|
|Business fundamentals|Having fundamental knowledge of the organization and the industry|
|Customer Focus|Actively look for ways to identify market demands and meet client needs|
|Working with Tools and Technology|Selecting, using and maintaining tools and technology to facilitate work activity|
|Dynamic (self-) re-skilling|ability to adopt to change|
|Professional network|involvement to professional network activities|
|Ethics|ethics in the use of technology, biased data collection and presentation|

## [GITBasics](https://geo-python.github.io/site/lessons/L2/git-basics.html)
[back](#Contents)

Git is a distributed version-control system for tracking changes in source code during software development.

## PythonBasics
[back](#Contents)

### Basic operations

In [4]:
# Numbers
10 + 4       # add
10 - 4       # subtract
10 * 4       # multiply
10 ** 4      # exponent
10 / 4       # divide
5 % 4        # modulo
10 // 4      # floor division
# Boolean operations
5>4
5>=3
5!=3
5 == 5
5 > 3 and 6 > 3
5 > 3 or 5 <3
not False
False or not False and True # evaluation order: not. and, or

True

### [Functions](https://docs.python.org/3/tutorial/controlflow.html)

Functions are set of instructions launched when called upon, they can have multiple input values and a return value. Usually has a [docstring](https://www.python.org/dev/peps/pep-0257/).

In [1]:
def double(x):
    '''This is an  example how a function would look like.
    The function starts with def then the name of the function.
    Then enclosed in the parenthesis is the input.
    There could be more than one input. Afterwhich, enclosed in 3 quotation
    marks, is a docstring that describe the function.
    Next is the body of the function and lastly the output you want.'''
    return x * 2

### [Strings](https://docs.python.org/3/library/string.html)

In [5]:
# create 
s = str(42) # convert another data type into a string 
s = 'I like you'

# single or double quotation marks could be used
print('data science')
print("data science")

# examine a string 
print(s[0]) # returns 'I' len(s) # returns 10

# string slicing like lists 
print(s[:6]) # returns 'I like' 
print(s[7:]) # returns 'you' 
print(s[-1]) # returns 'u'

# basic string methods (does not modify the original string)
print(s.lower()) # returns 'i like you' 
print(s.upper()) # returns 'I LIKE YOU' 
print(s.startswith('I')) # returns True 
print(s.endswith('you')) # returns True 
print(s.isdigit()) # returns False (returns True if every character in the string is, →a digit) 
print(s.find('like')) # returns index of first occurrence (2), but doesn't support regex 
print(s.find('hate')) # returns -1 since not found 
print(s.replace('like','love')) # replaces all instances of 'like' with 'love'

# split a string into a list of substrings separated by a delimiter 
print(s.split(' ')) # returns ['I','like','you'] 
print(s.split()) # same thing 
s2 = 'a, an, the' 
print(s2.split(',')) # returns ['a',' an',' the']

# join a list of strings into one string using a delimiter 
stooges = ['larry','curly','moe'] 
print(' '.join(stooges)) # returns 'larry curly moe'

# concatenate strings 
s3 = 'The meaning of life is' 
s4 = '42' 
print(s3 + ' ' + s4) # returns 'The meaning of life is 42' 
print(s3 + ' ' + str(42)) # same thing

# remove whitespace from start and end of a string 
s5 = ' ham and cheese ' 
print(s5.strip()) # returns 'ham and cheese'

# string substitutions: all of these return 'raining cats and dogs' 
print('raining %s and %s' % ('cats','dogs')) # old way 
print('raining {} and {}'.format('cats','dogs')) # new way 
print('raining {arg1} and {arg2}'.format(arg1='cats',arg2='dogs')) # named arguments

print('first line\nsecond line')
print(r'first line\nfirst line')

data science
data science
I
I like
you
u
i like you
I LIKE YOU
True
True
False
2
-1
I love you
['I', 'like', 'you']
['I', 'like', 'you']
['a', ' an', ' the']
larry curly moe
The meaning of life is 42
The meaning of life is 42
ham and cheese
raining cats and dogs
raining cats and dogs
raining cats and dogs
first line
second line
first line\nfirst line


### [Exceptions](https://docs.python.org/3/tutorial/errors.html)

When something goes wrong, Python raises an exception. Unhandled, these will cause your program to crash. You can handle them using try and except:

In [3]:
try:
    print (0/0)
except ZeroDivisionError:
    print ("cannot divide by zero")

cannot divide by zero


In [4]:
dct = dict(a=[1, 2], b=[4, 5])
key = 'c'
try: 
    dct[key] 
except: 
    print("Key %s is missing. Add it with empty value" % key) 
    dct['c'] = []
print(dct)


Key c is missing. Add it with empty value
{'a': [1, 2], 'b': [4, 5], 'c': []}


### [Lists](https://docs.python.org/3/tutorial/datastructures.html)

In [None]:
list_ex = [1,2,3]

### [Tuples](https://docs.python.org/3/tutorial/datastructures.html)

In [None]:
tuple_ex = (4, 5, 6)

### [Dictionaries](https://docs.python.org/3/tutorial/datastructures.html)

In [5]:
dictionary_ex = {'a' : 7, 'b' : 8, 'c' : 9}

#### [defaultdict](https://docs.python.org/2/library/collections.html)
(from Data Science from Scratch)

#### [counter](https://docs.python.org/2/library/collections.html)
(from Data Science from Scratch)

### [Sets](https://docs.python.org/3/tutorial/datastructures.html)

Represents a collection of distinct elements. They are: unordered, iterable, mutable, can contain multiple data types made up of unique elements (strings, numbers or tuples)

In [6]:
set_ex = {'d', 10, ('d', 10)}

### [Control Flow](https://docs.python.org/3/tutorial/controlflow.html)

#### [Conditional statements](https://docs.python.org/3/tutorial/controlflow.html)

In [7]:
x=3
if x > 0: 
    print('positive') 
elif x == 0: 
    print('zero') 
else: 
    print('negative')

positive


#### [Loops](https://docs.python.org/3/tutorial/controlflow.html)

In [9]:
# while loops
x = 0 
while x < 10:
    print (x, "is less than 10")
    x += 1

# for loops
for x in range(10):    
    print (x, "is < 10")

0 is less than 10
1 is less than 10
2 is less than 10
3 is less than 10
4 is less than 10
5 is less than 10
6 is less than 10
7 is less than 10
8 is less than 10
9 is less than 10
0 is < 10
1 is < 10
2 is < 10
3 is < 10
4 is < 10
5 is < 10
6 is < 10
7 is < 10
8 is < 10
9 is < 10


### [Sorting](https://docs.python.org/3/howto/sorting.html)

Sorting method for python lists.

In [None]:
# sort the list by absolute value from largest to smallest 
x = sorted([-4,1,-2,3], key=abs, reverse=True)  # is [-4,3,-2,1]

### [List Comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)

### [Generators and Iterator](https://docs.python.org/3/howto/functional.html#iterators)

### [Randomness](https://docs.python.org/2/library/random.html)

In data science, random numbers are often needed to be generated. One python module (random module) produces pseudorandom numbers.

In [10]:
import random
random.random()

0.34460870031142155

### [Regular Expression](https://docs.python.org/3/library/re.html)

Regular expressions provide a way of searching text. This is from the re module

In [12]:
import re
print (all([                                # all of these are true, because    
    not re.match("a", "cat"),              # * 'cat' doesn't start with 'a'    
    re.search("a", "cat"),                 # * 'cat' has an 'a' in it    
    not re.search("c", "dog"),             # * 'dog' doesn't have a 'c' in it    
    3 == len(re.split("[ab]", "carbs")),   # * split on a or b to ['c','r','s']    
    "R-D-" == re.sub("[0-9]", "-", "R2D2") # * replace digits with dashes    
    ]))  # prints True 

True


### [Object-Oriented Programming](https://docs.python.org/3/tutorial/classes.html)

In [None]:
class HospitalStaff:
    
    kind = 'doctor'             # class carieable shared by all instances
    
    def __init__(self, name):
        self.name = name       # instance variable unique to each instance
        self.department = []   # creates a new empty list for each doctor
        
    def add_department(self, department):
        self.tricks.append(trick)

'''
>>>l = HospitalStaff("Lyn")
>>>b = HospitalStaff("Bryan")
>>>l.kind
"doctor"
>>>b.kind
"doctor"
>>>l.name
"Lyn"
>>>b.name
"bryan"
>>>l.add_department("Toxicology")
>>>b.add_department("Infectious Disease")
>>>l.department
["Toxicology"]
>>>b.department
["Infectious Disease"]'''

### Functional Tools

#### [enumerate](https://docs.python.org/3/library/functions.html#enumerate)

#### [zip](https://docs.python.org/3/library/functions.html#enumerate) and Argument Unpacking

#### [args and kwargs](https://book.pythontips.com/en/latest/args_and_kwargs.html)

### System programming

#### [Operating system interfaces (os)](https://docs.python.org/3/library/os.html)

#### File input/output

In [None]:
filename = os.path.join(mytmpdir, "myfile.txt") print(filename)

# Write 
lines = ["Dans python tout est bon", "Enfin, presque"]

## write line by line fd = open(filename, "w") 
fd.write(lines[0] + "\n") 
fd.write(lines[1]+ "\n") 
fd.close()

## use a context manager to automatically close your file 
with open(filename, 'w') as f: 
    for line in lines: 
        f.write(line + '\n')

# Read 
## read one line at a time (entire file does not have to fit into memory) 
f = open(filename, "r") 
f.readline() # one string per line (including newlines) 
f.readline() # next line 
f.close()

## read one line at a time (entire file does not have to fit into memory) 
f = open(filename, 'r') 
f.readline() # one string per line (including newlines) 
f.readline() # next line 
f.close()

## read the whole file at once, return a list of lines 
f = open(filename, 'r') 
f.readlines() # one list, each line is one string 
f.close()

## use list comprehension to duplicate readlines without reading entire file at once 
f = open(filename, 'r') 
[line for line in f] 
f.close()

## use a context manager to automatically close your file 
with open(filename, 'r') as f: 
    lines = [line for line in f]
