## CSCI-467 Discussion (Week 1)

### Rules

1. Please read the instructions and problem prompts **carefully**.
2. This lab is to give you some basic APIs of numpy, pandas and scikit-learn. The lab is to be done individually. You may talk to your fellow classmates about general issues ("Remind me again: Which API should I used for doing group by operation to a data set") but about the specifies of how to do these exercises.
3. Along with a similar vein, you can ask the TA for help, but ask questions about **concepts** but not ask the TA to help you debug your code. The TA is here to help, but not to do the work for you.
4. You are welcome to use the class resources and the Internet.
5. Playing with variations. Solve one problems, and then copy the code to a new cell and play around with it. Doing this is the single most important thing when learning programming.
6. This lab will not be graded but the content is highly related to your future programming assignments. So, treat it wisely.
7. All the content having been gone though in the week 1 discussion is just a snapshot of the most basic concepts. **You need to keep study more about Git, GitHub, Pandas, Numpy and Scikit-Learn in order to finish your programming assignments successfully.**
8. Have fun!

### Setup Development Environment

There are many ways to setup the environment. But, I do recommend a simple idea that is using the Anaconda, which is a pre-build python environment with bundles of useful packages.

**To download the Anaconda, go to the following website:
https://www.anaconda.com/distribution/**. Download the correct version based on your operating system and install it step by step.

Then, **configure your PATH environment variable** to make the conda command work. The following command is an easy way to test whether your configuration is correct. If it is, you will see something as like as the sample output.

> **command:**
>
> conda --version
>
> **sample output:**
>
> conda 4.6.12

**Finally, download this jupyter notebook file,** then change the working directory to where its location in terminal, and type the following command to open the jupter notebook and finish the lab.

> **command:** 
> jupyter notebook

### Pandas

#### The read_csv() Method

First, read the documentation about the *read_csv()* method in Pandas (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). Then, try to read data from file Salaries.csv to a dataframe, make the column playerID in the csv file as the index column and the first row as the header. Also, skip the second row when reading the file.

In [12]:
import pandas as pd

df = pd.read_csv("Salaries.csv", header=[0], skiprows=[1])
df.set_index('playerID',inplace=True)

In [13]:
df.head()

Unnamed: 0_level_0,yearID,teamID,lgID,salary
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bedrost01,1985,ATL,NL,550000
benedbr01,1985,ATL,NL,545000
campri01,1985,ATL,NL,633333
ceronri01,1985,ATL,NL,625000
chambch01,1985,ATL,NL,800000


#### Indexing and Selecting Data

Select the id of the players who are registered in ATL and HOU and whose salary is higher than one million.

In [25]:
df[((df['teamID'] == 'ATL') | (df['teamID'] == 'HOU')) & (df['salary'] > 1000000)]

Unnamed: 0_level_0,yearID,teamID,lgID,salary
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hornebo01,1985,ATL,NL,1500000
murphda05,1985,ATL,NL,1625000
suttebr01,1985,ATL,NL,1354167
ryanno01,1985,HOU,NL,1350000
hornebo01,1986,ATL,NL,1800000
murphda05,1986,ATL,NL,1825000
suttebr01,1986,ATL,NL,1729167
ryanno01,1986,HOU,NL,1125000
griffke01,1987,ATL,NL,1150000
murphda05,1987,ATL,NL,1925000


#### The describe() Method

Calculate the standard Deviation, first quartile, medium, third quartile, mean, maximum, minimum of the salary in team ATL.

In [30]:
import numpy as np

In [33]:
teamATLSalaries = df[df['teamID'] == 'ATL']
teamATLSalaries = teamATLSalaries['salary']

first_quartile = np.percentile(teamATLSalaries, 25)
second_quartile = np.percentile(teamATLSalaries, 50)
third_quartile = np.percentile(teamATLSalaries, 75)
mean = teamATLSalaries.mean()
myMax = teamATLSalaries.max()
myMin = teamATLSalaries.min()

print(first_quartile, second_quartile, third_quartile, mean, myMax, myMin)

(300000.0, 600000.0, 2400000.0, 2207749.190960452, 16061802, 60000)


#### The iterrows() Method

Create a Python dictionary object whose keys are the headers of the dataframe created in the read_csv() exercise and values are Python list objects that contain data corresponding to the headers. (Here, use the iterrows method to iterate each row of the dataframe and copy it to a dictionary. However, there is a easier way. Learn how the to_dict() method works by yourself later)

In [43]:
my_dict = {header:list() for header in df.columns}

for index,row in df.iterrows():
    my_dict['lgID'].append(row['lgID'])
    my_dict['salary'].append(row['salary'])
    my_dict['yearID'].append(row['yearID'])
    my_dict['teamID'].append(row['teamID'])

In [48]:
len(my_dict['teamID'])

25574

#### Create Dataframe Using the Constructor

Read the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame and create a dataframe using pd.DataFrame from the dictionary created in the iterrows() exercise. Change the header to "a", "b", "c", ... at creation time.

In [60]:
newDF = pd.DataFrame({'a':my_dict['lgID'], 'b':my_dict['salary'], 'c':my_dict['yearID'], 'd':my_dict['teamID']})

In [61]:
newDF.head()

Unnamed: 0,a,b,c,d
0,NL,550000,1985,ATL
1,NL,545000,1985,ATL
2,NL,633333,1985,ATL
3,NL,625000,1985,ATL
4,NL,800000,1985,ATL


### Numpy

Quick start: https://www.numpy.org/devdocs/user/quickstart.html

Numpy axes explaination: https://www.sharpsightlabs.com/blog/numpy-axes-explained/

#### The np.array Method

Example 1:

```python
ls = [1, 2, 3]
arr = np.array(ls)
```

Example 2:
```python
>>> np.array([[1, 2], [3, 4]])
array([[1, 2],
       [3, 4]])
```

Now, create a 2-dimensional Python list object, then convert it to a Numpy array object.

In [62]:
myList = [[1,2],[3,4],[5,6],[7,8],[9,0]]
npList = np.array(myList)

In [63]:
npList

array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8],
       [9, 0]])

#### ndarray Objects' Attributes

Play with the **ndim, shape, size, dtype, itemsize and data** attribute.

Example:

```python
>>> arr = np.array([[1, 2], [3, 4]])
>>> arr.ndim
2
```

In [69]:
npList.data

<read-write buffer for 0x113137c10, size 80, offset 0 at 0x1135a8f70>

#### Dimension of ndarray Ojects

Play with the reshape() and flatten() method.

Example:
```python
>>> arr = np.array([[1, 2], [3, 4]])
>>> arr.flatten()
array([1, 2, 3, 4])
```

In [76]:
npList.reshape([2,5])

array([[1, 2, 3, 4, 5],
       [6, 7, 8, 9, 0]])

#### The Slice Operation of ndarray Objects

Understand how the slice operation works for 1-D array and 2-D array.

Example:

```python
>>> arr = np.array([[1, 2, 3], [3, 4, 6], [7, 8, 9]])
>>> arr[1:]
array([[3, 4, 6],
       [7, 8, 9]])
>>> arr[1:, 0:2]
array([[3, 4],
       [7, 8]])
```

In [81]:
npList[1:]

array([[3, 4],
       [5, 6],
       [7, 8],
       [9, 0]])

In [80]:
npList[1:,0:1]

array([[3],
       [5],
       [7],
       [9]])

#### The Calculation of ndarray Objects

Play with the **argmin(), argmax(), min(), max(), mean(), sum(), std(), dot(), square(), sqrt(), abs(). exp(), sign(), mod()** method.

Example:

```python
>>> np.square(array)
array([[ 1,  4,  9],
       [ 9, 16, 36],
       [49, 64, 81]])

```

In [83]:
npList

array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8],
       [9, 0]])

In [84]:
npList.argmin() # return index of min element

9

In [86]:
npList.argmax() # return index of max element

8

In [96]:
b = np.arange(0,10).reshape([2,5])
npList.dot(b)

array([[ 10,  13,  16,  19,  22],
       [ 20,  27,  34,  41,  48],
       [ 30,  41,  52,  63,  74],
       [ 40,  55,  70,  85, 100],
       [  0,   9,  18,  27,  36]])

In [101]:
np.square(npList)

array([[ 1,  4],
       [ 9, 16],
       [25, 36],
       [49, 64],
       [81,  0]])

In [102]:
np.sqrt(npList)

array([[1.        , 1.41421356],
       [1.73205081, 2.        ],
       [2.23606798, 2.44948974],
       [2.64575131, 2.82842712],
       [3.        , 0.        ]])

In [106]:
np.sign(npList) # get pos neg values

array([[1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 0]])

In [107]:
np.mod(npList, 4)

array([[1, 2],
       [3, 0],
       [1, 2],
       [3, 0],
       [1, 0]])

#### Other Important Methods Inside Module Numpy

Play with the arange(), ones(), zeros(), eye(), linspace(), concatenate() method.

Example:

```python
>>> np.eye(3)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
```

In [112]:
np.arange(0,20).reshape([4,5])

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [116]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [118]:
np.eye(4) # identity matrix of x rows

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [121]:
np.linspace(0,2,10) # gets you 10 numbers between 0 and 2 that are . evenly spaced apart

array([0.        , 0.22222222, 0.44444444, 0.66666667, 0.88888889,
       1.11111111, 1.33333333, 1.55555556, 1.77777778, 2.        ])

In [125]:
np.concatenate([npList, npList],axis=1) # here we are concatenating two ndarrays horizontally

array([[1, 2, 1, 2],
       [3, 4, 3, 4],
       [5, 6, 5, 6],
       [7, 8, 7, 8],
       [9, 0, 9, 0]])

### Scikit-Learn

The followings are packages (or methods) in Python (Scikit-Learn and Scipy) that will be frequently used in your programming assignment. So, please read carefully.

- Data Preprocessing (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
    - Standardization: StandardScaler
    - Normalization: MinMaxScaler
    - Quantifing Categorical Features: LabelEncoder. OneHotEncoder
    - Construct Train and Test Set: model_selection.train_test_split
- KNN: KNeighborsClassifier
- Linear Regression: LinearRegression
- Logistic Regression: LogisticRegression, LogisticRegressionCV
- Feature Selection / Model Selection
    - L1 Penalized Regression (Lasso Regression) with Cross-Validation: LassoCV
    - L2 Penalized Regression (Ridge Regression) with Cross-Validation: RidgeCV
    - Cross-Validation: StratifiedKFold, RepeatedKFold, LeaveOneOut, KFold, model_selection.cross_validate, model_selection.cross_val_predict, model_selection.cross_val_score
    - Model Metrics (https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics): accuracy_score, auc, f1_score, hamming_loss, precision_score, recall_score, roc_auc_score
- Decision Tree: DecisionTreeClassifier, DecisionTreeRegressor
- Bootstrap, Ensemble Methods
    - Bootstrap: bootstrapped (https://pypi.org/project/bootstrapped/)
    - Bagging: RandomForestClassifier, RandomForestRegressor
    - Boosting: AdaBoostClassifier, AdaBoostRegressor
- Support Vector Machines (https://scikit-learn.org/stable/modules/svm.html#svm): LinearSVC, LinearSVR
- Multiclass and Multilabel Classification (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.multiclass)
    - One-vs-one Multiclass Strategy: OneVsOneClassifier
    - One-vs-the-rest (OvR) multiclass/multilabel strategy / OneVsRestClassifier
- Unsupervised Learning
    - K-means Clustering: KMeans
    - Hierarchical Clustering: scipy.cluster.hierarchy (not scikit-learn)
- Semisupervised Learning (https://scikit-learn.org/stable/modules/label_propagation.html)

### Git and GitHub

1. In the directory that where this jupyter notebook file locates in, init a Git repository.
2. Checkout a new branch called dev and commit the current notebook within this branch.
3. Merge the dev branch to the master branch (the default branch).
4. Create a temporary repository (just for practicing and you can delete it later) in GitHub. 
5. Push new changes in the master branch to the remote repository created in step 4.
6. Checkout the dev branch again and do some changes to your notebook, and then repeat step 3 and step 5.