CS 252 Day 1, Jasper

-------------------

Topics:
* Greetings, fellow human
* Syllabus
* Tech infrastructure
* Format of data and terminology
* CSV files

__Greetings, Fellow Human__

* What is your name?
* What is the most unusual animal you have ever interacted with?

Jasper

__Syllabus__

[https://cs.colby.edu/srtaylor/courses/S23/cs251/](https://cs.colby.edu/srtaylor/courses/S23/cs251/)

* When are lab sessions and who may attend?
* When and where are Amanda Stent's office hours?
* When and where are the TAs' office hours?
* When is the first lab due?
* When is the first project due?
* What will happen on Friday?

__Tech Infrastructure__

* Python
* Python modules: numpy scipy pandas palettable Pillow matplotlib seaborn jupyter scikit-learn
* Opening a jupyter notebook
  * [Jupyterhub](http://cs251.jupyter.colby.edu/)
  * [Github CodeSpaces](https://github.com/ajstent/CS252S22)
  * [Colab](https://colab.research.google.com)
  * [Visual Studio](https://cs.colby.edu/srtaylor/courses/S23/cs251/software.html)
* [Git](https://github.com)
  * [Lectures](https://github.com/ajstent/CS252S22)
  * [Projects and Labs](https://github.com/ajstent/CS252S22ProjectsLabs)
  * [Cheatsheet](https://training.github.com/downloads/github-git-cheat-sheet.pdf)

Let's make sure we all:
1. Have Github accounts
2. Have "forked" the ProjectsLabs repository
3. Know how to add a file to a repository and to commit changes
4. Know how to download a notebook for class

__Format of Data and Terminology__

* What is "data"? 
  * [Data is Plural](https://www.data-is-plural.com/)
* What is "metadata" and how does it relate to data? 
  * [Data Nutrition](https://datanutrition.org/)
* How can we understand data? 
  * [Flowing Data](https://flowingdata.com/)
* Data terms and what do they mean:
  * Data point (feature vector)
  * Feature (variable)
    * Independent variable
    * Dependent variable
    * Missing data
  * Feature measures
    * Minimum
    * Maximum
    * Range
    * Precision
    * Accuracy
  * Feature manipulations
    * Scaling
    * Normalizing
  * Multivariate data
    * Dimension
  * Metadata
  
  In the example below, each row in the CSV is a data point; each cell is a feature; the last cell is the dependent variable (but it needn't be); there are definitely missing features; and the numeric features are interpreted as text which makes it hard for us to measure or manipulate them. The first row provides all the metadata there is.


__CSV files__

* How can you read a csv file?
* How can you write a csv file?

In [1]:
import codecs
import csv

# If we read this way the csv reader "fills in" missing values as blanks, which may or may not be great
with codecs.open('data/submissions_summary.csv', 'r', encoding='utf8') as f:
    # what are some of the optional arguments here?
    csvreader = csv.reader(f)
    rows = [x for x in csvreader]
print(len(rows))
print(rows)
print(len(rows[0]))

72
[['submissionNumber', 'submissionType', 'submissionTimestamp', 'numReviews', 'overallMin', 'overallMax', 'overallAverage', 'confidenceMin', 'confidenceMax', 'confidenceAverage', 'ethicsReviews', 'overallMeta', 'comment', 'deskRejectComment', 'numberOfOtherNotes'], ['70', 'cata', '23:35.1', '3', '2', '4', '3', '4', '4', '4', '1', '4', 'I like this one', '[]', '2'], ['20', 'cata', '23:31.3', '3', '2.5', '4', '3.166666667', '4', '4', '4', '0', '3', '', '[]', '6'], ['65', 'cata', '23:34.7', '3', '2.5', '3.5', '3', '2', '5', '2', '1', '4', 'I like this one', '[]', '2'], ['66', 'cata', '23:34.8', '3', '2.5', '3.5', '3.166666667', '2', '4', '2', '0', '4', 'I like this one', '[]', '3'], ['59', 'cata', '23:34.3', '3', '3', '3.5', '3.333333333', '3', '4', '3', '1', '4', '', '[]', '4'], ['61', 'cata', '23:34.4', '3', '2', '4', '3.166666667', '4', '4', '4', '2', '4', '', '[]', '2'], ['35', 'cata', '23:32.4', '3', '3', '4', '3.5', '2', '4', '2', '0', '4', '', '[]', '1'], ['55', 'cata', '23:34.0'

In [2]:
import numpy as np

# if we read this way, it won't fill in missing values, it will just throw an error
# what are some of the optional arguments here?
data = np.loadtxt('data/submissions_summary.csv', dtype=str)
data

ValueError: the number of columns changed from 1 to 4 at row 2; use `usecols` to select a subset and avoid this error

In [3]:
import codecs
import csv
import numpy as np

# if we read this way we get the combination of the good things from both the above, but it's still not perfect
# maybe some of those optional function arguments would help eg coerce numbers into ints/floats... hmm...
# for your first project you will create a reader that works really well, but you will *not use* numpy to do this.
with codecs.open('data/submissions_summary.csv', 'r', encoding='utf8') as f:
    # what are some of the optional arguments here?
    data = np.array([x for x in csv.reader(f)])
data

array([['submissionNumber', 'submissionType', 'submissionTimestamp', ...,
        'comment', 'deskRejectComment', 'numberOfOtherNotes'],
       ['70', 'cata', '23:35.1', ..., 'I like this one', '[]', '2'],
       ['20', 'cata', '23:31.3', ..., '', '[]', '6'],
       ...,
       ['8', 'cata', '23:30.4', ..., '', '[]', '2'],
       ['1', 'cata', '23:30.0', ..., 'I do not like this one', '[]', '2'],
       ['27', 'catc', '23:31.8', ..., '', 'This submission is invalid.',
        '0']], dtype='<U27')

In [4]:
with codecs.open('out.csv', 'w', encoding='utf8') as f:
    # what are some of the optional arguments here?
    csv_writer = csv.writer(f, quoting=csv.QUOTE_NONNUMERIC)
    csv_writer.writerows(data)
    # what do we notice when we look at this output file?

How does this relate to [project 1](https://cs.colby.edu/srtaylor/courses/S23/cs251/projects/p1datavis/p1datavis251.html)?