![data-x](https://raw.githubusercontent.com/afo/data-x-plaksha/master/imgsource/dx_logo.png) 

## Introduction to Data-X
Mostly basics about Anaconda, Git, Python, and Jupyter Notebooks

#### Author: Alexander Fred Ojala

---


# Useful Links
1. Managing conda environments:
    - https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
2. Github:
    - https://readwrite.com/2013/09/30/understanding-github-a-journey-for-beginners-part-1/
    - https://readwrite.com/2013/10/02/github-for-beginners-part-2/
3. Learning Python (resources):
    - https://www.datacamp.com/
    - [Python Bootcamp](https://bids.berkeley.edu/news/python-boot-camp-fall-2016-training-videos-available-online
)
4. Datahub: http://datahub.berkeley.edu/ (to run notebooks in the cloud)
5. Google Colab: https://colab.research.google.com (also running notebooks in the cloud)
5. Data-X website resources: https://data-x.blog
6. Book: [Hands on Machine Learning with Scikit-Learn and Tensorflow](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=sr_1_1?ie=UTF8&qid=1516300239&sr=8-1&keywords=hands+on+machine+learning+with+scikitlearn+and+tensorflow)

# Introduction to Jupyter Notebooks

From the [Project Jupyter Website](https://jupyter.org/):

* *__Project Jupyter__ exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages. Collaborative, Reproducible.*

* *__The Jupyter Notebook__ is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.*

# Notebook contains 2 cell types Markdown & Code

###### Markdown cells

Where you write text.

Or, equations in Latex: $erf(x) = \frac{1}{\sqrt\pi}\int_{-x}^x e^{-t^2} dt$

Centered Latex Matrices:

$$
\begin{bmatrix}
    x_{11} & x_{12} & x_{13} & \dots  & x_{1n} \\
    x_{21} & x_{22} & x_{23} & \dots  & x_{2n} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{d1} & x_{d2} & x_{d3} & \dots  & x_{dn}
\end{bmatrix} 
$$

<div class='alert alert-warning'>Bootstrap CSS and `HTML`</div>

Python (or any other programming language) Code
```python
# simple adder function
def adder(x,y):
    return x+y
```

# Header 1
## Header 2
### Header 3...

**bold**, *italic*

Divider

_____

* Bullet
* Lists


1. Enumerated
2. Lists

Useful images:
![](https://image.slidesharecdn.com/juan-rodriguez-ucberkeley-120331003737-phpapp02/95/juanrodriguezuc-berkeley-3-728.jpg?cb=1333154305)

<img src='https://image.slidesharecdn.com/juan-rodriguez-ucberkeley-120331003737-phpapp02/95/juanrodriguezuc-berkeley-3-728.jpg?cb=1333154305' width='200px'>

---

An internal (HTML) link to section in the notebook:


## <a href='#bottom'>Link: Take me to the bottom of the notebook</a>

___

## **Find a lot of useful Markdown commands here:** 
### https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet

___

# Code Cells
In them you can interactively run Python commands

In [1]:
print('hello world!')
print('2nd row')

hello world!
2nd row


In [2]:
# Comment in a code cells

In [3]:
# Lines evaluated sequentially
# A cell displays output of last line
2+2
3+3
5+5

10

In [4]:
# Stuck in an infinite loop
while True:
    continue

KeyboardInterrupt: 

In [None]:
# Cells evaluated sequentially

In [5]:
tmp_str = 'this is now stored in memory'
print(tmp_str)

this is now stored in memory


In [6]:
print("Let's Start Over")

Let's Start Over


In [7]:
print(tmp_str)

this is now stored in memory


## Jupyter / Ipython Magic

In [8]:
# Magic commands (only for Jupyter and IPython, won't work in script)
%ls

[1m[36marchive[m[m/                     python-jupyter-basics.ipynb
data-x-intro-lec.pdf         [1m[36mresources[m[m/
guide-to-resources.pdf


In [9]:
# Time several runs of same operation
%timeit [i for i in range(1000)];

27.5 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [10]:
# Time operation
%time 
[x for x in range(1000)];

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.1 µs


In [11]:
%ls resources/

print_hw3.py     random.txt       sample_data.csv  spam.png


In [12]:
# %load resources/print_hw3.py
def print_hw(x):
    for i in range(int(x)):
        print(str(i)+' hello python script!')

print_hw(3)


0 hello python script!
1 hello python script!
2 hello python script!


In [13]:
%matplotlib inline

In [14]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %conda  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%

In [15]:
?%alias

[0;31mDocstring:[0m
Define an alias for a system command.

'%alias alias_name cmd' defines 'alias_name' as an alias for 'cmd'

Then, typing 'alias_name params' will execute the system command 'cmd
params' (from your underlying operating system).

Aliases have lower precedence than magic functions and Python normal
variables, so if 'foo' is both a Python variable and an alias, the
alias can not be executed until 'del foo' removes the Python variable.

You can use the %l specifier in an alias definition to represent the
whole line when the alias is called.  For example::

  In [2]: alias bracket echo "Input in brackets: <%l>"
  In [3]: bracket hello world
  Input in brackets: <hello world>

You can also define aliases with parameters using %s specifiers (one
per parameter)::

  In [1]: alias parts echo first %s second %s
  In [2]: %parts A B
  first A second B
  In [3]: %parts A
  Incorrect number of arguments: 2 expected.
  parts is an alias to: 'echo first %s second %s'

Note that %l

In [16]:
?str

[0;31mInit signature:[0m [0mstr[0m[0;34m([0m[0mself[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.
[0;31mType:[0m           type
[0;31mSubclasses:[0m     DeferredConfigString, _rstr, LSString, include, ColorDepth, Keys, InputMode, CompleteStyle, SortKey, str_, ...


## Terminal / Command Prompt commands

In [17]:
# Shell commands
!cat resources/random.txt

data-x is the best class at uc berkeley!
//anonymous

line 4


In [18]:
!ls # in mac

[1m[36marchive[m[m                     python-jupyter-basics.ipynb
data-x-intro-lec.pdf        [1m[36mresources[m[m
guide-to-resources.pdf


In [19]:
!dir #in windows

zsh:1: command not found: dir


In [20]:
# show first lines of a data file
!head -n 1 resources/sample_data.csv

1,"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35,Nunavut,Storage & Organization,0.8


In [21]:
# count rows of a data file
!wc resources/sample_data.csv

    1000    8606  121306 resources/sample_data.csv


# Useful tips (Keyboard shortcuts etc):
4. Enter selection mode / Cell mode (Esc / Return)
1. Insert cells (press A or B in selection mode)

2. Delete / Cut cells (press X in selection mode)
3. Mark several cells (Shift in selection mode)
6. Merge cells (Select, then Shift+M)

# Printing to pdf 
### (USEFUL FOR HOMEWORKS)
**Easiest**: File -> Print Preview. 
Then save that page as a PDF (Ctrl + P, Save as PDF usually works).

**Pro:** Install a Latex compiler. Then: File -> Download As -> PDF.

# Quick Review of Python Topics

### Check what Python distribution you are running

In [22]:
!which python #works on unix system, maybe not Windows

/Users/afo/miniconda3/envs/data-x/bin/python


In [23]:
# Check that it is Python 3
import sys # import built in package
print(sys.version)

3.7.11 (default, Jul 27 2021, 07:03:16) 
[Clang 10.0.0 ]


## Python as a calculator

In [24]:
# Addition
2.1 + 2

4.1

In [25]:
# Mult
10*10.0

100.0

In [26]:
# Floor division
7//3

2

In [27]:
# Floating point division, note py2 difference
7/3

2.3333333333333335

In [28]:
type(2)

int

In [29]:
type(2.0)

float

In [30]:
a = 3
b = 5
print (b**a) # ** is exponentiation

125


In [31]:
print (b%a)  # modulus operator = remainder

2


In [32]:
type(5) == type(5.0)

False

In [33]:
# boolean checks
a = True
b = False
print (a and b)

False


In [34]:
# conditional programming
if 5 == 5:
    print('correct!')
else:
    print('what??')

correct!


In [35]:
print (isinstance(1,int))

True


## String slicing and indices
<img src="resources/spam.png" width="480">

In [36]:
# Strings and slicing
x = "abcdefghijklmnopqrstuvwxyz"

In [37]:
print(x)

abcdefghijklmnopqrstuvwxyz


In [38]:
print(x[1]) # zero indexed

b


In [39]:
print (type(x))

<class 'str'>


In [40]:
print (len(x))

26


In [41]:
print(x)

abcdefghijklmnopqrstuvwxyz


In [42]:
print (x[1:6:2]) # start:stop:step

bdf


In [43]:
print (x[::3])

adgjmpsvy


In [44]:
print (x[::-1])

zyxwvutsrqponmlkjihgfedcba


### Manipulating text

In [45]:
# Triple quotes are useful for multiple line strings
y = '''The quick brown 
fox jumped over 
the lazy dog.'''
print (y)

The quick brown 
fox jumped over 
the lazy dog.


### String operators and methods

In [46]:
# tokenize by space
words = y.split(' ')
print (words)

['The', 'quick', 'brown', '\nfox', 'jumped', 'over', '\nthe', 'lazy', 'dog.']


In [47]:
# remove break line character
[w.replace('\n','') for w in words]

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']

<div class='alert alert-success'>TAB COMPLETION TIPS</div>

In [48]:
words.append('last words')

In [49]:
import pandas as pd

In [50]:
?pd.read_excel

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_excel[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mio[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msheet_name[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_col[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0musecols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msqueeze[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m:[0m [0;34m'DtypeArg | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mengine[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconverters[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrue_values[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m

In [51]:
y.

SyntaxError: invalid syntax (927355991.py, line 1)

In [None]:
str()

# Data Structures

## **Tuple:** Sequence of Python objects. Immutable.

In [52]:
t = ('a','b', 3)
print (t) 
print (type (t))
t[1]

('a', 'b', 3)
<class 'tuple'>


'b'

In [53]:
t[1] = 2 #error

TypeError: 'tuple' object does not support item assignment

## **List:** Sequence of Python objects. Mutable

In [54]:
y = list() # create empty list
type(y)

list

In [55]:
type([])

list

In [56]:
# Append to list
y.append('hello')
y.append('world')
print(y)

['hello', 'world']


In [57]:
y.pop(1)

'world'

In [58]:
print(y)

['hello']


In [59]:
# List addition (merge)
y + ['data-x']

['hello', 'data-x']

In [60]:
# List multiplication
y*4

['hello', 'hello', 'hello', 'hello']

In [61]:
# list of numbers
even_nbrs = list(range(0,20,2)) # range has lazy evaluation
print (even_nbrs)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]


In [62]:
# supports objects of different data types
z = [1,4,'c',4, 2, 6]
print (z)

[1, 4, 'c', 4, 2, 6]


In [63]:
# list length (number of elements)
print(len(z))

6


In [64]:
# it's easy to know if an element is in a list
print ('c' in z)

True


In [65]:
print (z[2])  # print element at index 2

c


In [66]:
# traverse / loop over all elements in a list
for i in z:
    print (i)

1
4
c
4
2
6


In [67]:
# lists can be sorted, 
# but not with different data types
z.sort()

TypeError: '<' not supported between instances of 'str' and 'int'

In [68]:
#z.sort() # doesn't work
z.pop(2)

'c'

In [69]:
z

[1, 4, 4, 2, 6]

In [70]:
z.sort() # now it works!
z

[1, 2, 4, 4, 6]

In [71]:
print (z.count(4))  # how many times is there a 4

2


In [72]:
# loop examples
for x in z:
    print ("this item is ", x)

this item is  1
this item is  2
this item is  4
this item is  4
this item is  6


In [73]:
# print with index
for i,x in enumerate(z):
    print ("item at index ", i," is ",  x )

item at index  0  is  1
item at index  1  is  2
item at index  2  is  4
item at index  3  is  4
item at index  4  is  6


In [74]:
# print all even numbers up to an integer
for i in range(0,10,2):
    print (i)

0
2
4
6
8


In [75]:
# list comprehesion is like f(x) for x as an element of Set X
# S = {x² : x in {0 ... 9}}
S = [x**2 for x in range(10)]
print (S)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [76]:
# All even elements from S
# M = {x | x in S and x even}
M = [x for x in S if x % 2 == 0]
print (M)

[0, 4, 16, 36, 64]


In [77]:
# Matrix representation with Lists
print([[1,2,3],[4,5,6]]) # 2 x 3 matrix

[[1, 2, 3], [4, 5, 6]]


# Sets (collection of unique elements)

In [78]:
# a set is not ordered
a = set([1, 2, 3, 3, 3, 4, 5,'a'])
print (a)

{1, 2, 3, 4, 5, 'a'}


In [79]:
b = set('abaacdef')
print (b) # not ordered

{'b', 'e', 'f', 'd', 'c', 'a'}


In [80]:
print (a|b) # union of a and b

{1, 2, 3, 4, 5, 'b', 'e', 'f', 'd', 'c', 'a'}


In [81]:
print(a&b) # intersection of a and b

{'a'}


In [82]:
a.remove(5)
print (a) # removes the '5'

{1, 2, 3, 4, 'a'}


# Dictionaries: Key Value pairs
Almost like JSON data

In [83]:
# Dictionaries, many ways to create them
# First way to create a dictionary is just to assign it
D1 = {'f1': 10, 'f2': 20, 'f3':25}              

In [84]:
D1

{'f1': 10, 'f2': 20, 'f3': 25}

In [85]:
D1['f2']

20

In [86]:
# 2. creating a dictionary using the dict()
D2 = dict(f1=10, f2=20, f3 = 30)
print (D2['f3'])

30


In [87]:
# 3. Another way, start with empty dictionary
D3 = {}
D3['f1'] = 10
D3['f2'] = 20
print (D3['f1'])

10


In [88]:
# Dictionaries can be more complex, ie dictionary of dictionaries or of tuples, etc.
D5 = {}
D5['a'] = D1
D5['b'] = D2
print (D5['a']['f3'])

25


In [89]:
D5

{'a': {'f1': 10, 'f2': 20, 'f3': 25}, 'b': {'f1': 10, 'f2': 20, 'f3': 30}}

In [90]:
# traversing by key
# key is imutable, key can be number or string
for k in D1.keys():
    print (k)

f1
f2
f3


In [91]:
# traversing by values
for v in D1.values(): 
    print(v)

10
20
25


In [92]:
# traverse by key and value is called item
for k, v in D1.items():                # tuples with keys and values
    print (k,v)

f1 10
f2 20
f3 25


# User input

In [None]:
# input
# raw_input() was renamed to input() in Python v3.x
# The old input() is gone, but you can emulate it with eval(input())

print ("Input a number:")
s = input()  # returns a string
a = int(s)
print ("The number is ", a)

Input a number:


# Import packages

In [79]:
import numpy as np

In [80]:
np.subtract(3,1)

2

# Functions

In [81]:
def adder(x,y):
    s = x+y
    return(s)

In [82]:
adder(2,3)

5

# Classes

In [83]:
class Holiday():
    def __init__(self,holiday='Holidays'):
        self.base = 'Happy {}!'
        self.greeting = self.base.format(holiday)
    
    def greet(self):
        print(self.greeting)
        
easter = Holiday('Easter')
hanukkah = Holiday('Hanukkah')

In [84]:
easter.greeting

'Happy Easter!'

In [85]:
hanukkah.greet()

Happy Hanukkah!


In [86]:
# extend class

class Holiday_update(Holiday):
    
    def update_greeting(self, new_holiday):
        self.greeting = self.base.format(new_holiday)

In [87]:
hhg = Holiday_update('July 4th')

In [88]:
hhg.greet()

Happy July 4th!


In [89]:
hhg.update_greeting('Labor day / End of Burning Man')
hhg.greet()

Happy Labor day / End of Burning Man!


<div id='bottom'></div>