# [CPSC 222](https://github.com/GonzagaCPSC222) Intro to Data Science
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Introduction to Tabular Data
What are our learning objectives for this lesson?
* Learn about terminology associated with tabular data
* Learn about the steps involved in data preprocessing
* Learn about different attribute types

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm-up Task(s)
* Review the 2D list practice problem solution on Github (it is in U2 repo in the FileFun folder)
    * Let me know if you have any questions
* Prepare to write some more notes on tabular data

## Today
* Attendance
    * No afternoon question today... we are going to take lots of notes and hopefully start Pandas!
* Announcements
    * IQ3 today on Python basics (through 1D lists and strings)
    * VQ4 is due on Sunday
    * DA2 is due on Monday
    * DA3 is posted (though we will need to cover more of Pandas on Monday)
* Today
    * Finish FileFun
    * Notes on tabular data
    * Start PandasFun

## Tabular Data
Our focus is “Tabular” Data ... aka Relational or Structured
* Data is organized into tables (rows and columns)

Age |Gender |Impressions |Clicks |SignedIn
-|-|-|-|-|
59 |1 |4 |0 |1
19 |0 |5 |0 |1
44 |1 |5 |0 |1
28 |1 |4 |0 |1
61 |1 |10 |1 |1
0 |0 |3 |1 |0

* Each row is an "instance"
    * aka "example", "record", or "object"
* Each column is an “attribute” (of the instance)
    * aka "variables" or "fields"
* A "dataset" is a (sample) set of instances
    * from the "universe of objects" (universe of instances)

This is a sample of (simulated) daily website click stream data (Example from "Doing Data Science", Schutt and O’Neil)
* Each row contains attribute values for one user
* User’s age, gender (0=female, 1=male), ads shown, ads clicked, and if
logged in (0=no, 1=yes)

### Keys
An (optional) "key" is one or more attributes with unique values
* The values uniquely identify an instance
For example:

UserId |Age |Gender |Impressions |Clicks |SignedIn
-|-|-|-|-|-|
20 |59 |1 |4 |0 |1
15 |19 |0 |5 |0 |1
31 |44 |1 |5 |0 |1
71 |28 |1 |4 |0 |1
51 |61 |1 |10 |1 |1
60 |0 |0 |3 |1 |0

* here, each UserId value identifies the user

Q: What was the key w/out UserId? ... A: None (row id)

### Multiple Attribute Keys

CarName |ModelYear |MSRP
-|-|-
ford pinto |75 |2769
toyota corolla |75 |2711
ford pinto |76 |3025
toyota corolla |76 |2789
... |... |...

Q: What are the key attributes? ... A: {CarName, ModelYear}

Q: Why not just CarName? ... A: Values not unique across rows

### Foreign Keys
A “Foreign Key” is a reference to instances
* typically to instances in another table
* but could be to the same table

SaleId |EmployeeId |CarName |ModelYear |Amt
-|-|-|-|-
555 |12 |ford pinto |75 |3076
556 |12 |toyota corolla |75 |2611
998 |13 |toyota corolla |75 |2800
999 |12 |toyota corolla |76 |2989
... |... |... |... |...


Q: What are the foreign keys (references)?
* {CarName, ModelYear}
* {EmployeeId} for information about the salesperson

Q: What is the key?
* {SaleId}

### Join
We can “Join” (combine) two tables on any attributes
* but typically on keys/foreign keys

SaleId |EmployeeId |<mark>CarName</mark> |<mark>ModelYear</mark> |Amt
-|-|-|-|-
555 |12 |ford pinto |75 |3076
556 |12 |toyota corolla |75 |2611
998 |13 |toyota corolla |75 |2800
999 |12 |toyota corolla |76 |2989
... |... |... |... |...

<mark>CarName</mark> |<mark>ModelYear</mark> |MSRP
-|-|-
ford pinto |75 |2769
toyota corolla |75 |2711
ford pinto |76 |3025
toyota corolla |76 |2789
... |... |...


Note that only matches are returned!
* sometimes we may want to keep non-matches (by “null” padding)
* where a “null” value means a missing value
* we’ll use “NA” to mean null

A “Full Outer Join” keeps non matched values

SaleId |EmployeeId |CarName |ModelYear |Amt
-|-|-|-|-
555 |12 |ford pinto |75 |3076
556 |12 |toyota corolla |75 |2611
998 |13 |toyota corolla |75 |2800
999 |12 |<mark>toyota corolla</mark> |<mark>76</mark> |2989

CarName |ModelYear |MSRP
-|-|-
ford pinto |75 |2769
toyota corolla |75 |2711
<mark>ford pinto</mark> |<mark>76</mark> |3025
<mark>toyota corolla</mark> |<mark>77</mark> |2789

Result:

SaleId |EmployeeId |CarName |ModelYear |Amt |MSRP
-|-|-|-|-|-
555 |12 |ford pinto |75 |3076 |2769
556 |12 |toyota corolla |75 |2611 |2711
998 |13 |toyota corolla |75 |2800 |2711
999 |12 |<mark>toyota corolla</mark> |<mark>76</mark> |2989 |NA
NA |NA |<mark>ford pinto</mark> |<mark>76</mark> |NA |3025
NA |NA |<mark>toyota corolla</mark> |<mark>77</mark> |NA |2989

* left outer join = join + rows in first table w/out matches in second
* right outer join = join + rows in second table w/out matches in first

Q: How would we join these two tables? What is different?

MPG |Cyls |Displacement | Hrspwr | Wght| Accel |ModelYear| Origin |CarName
-|-|-|-|-|-|-|-|-
23.0 | 4 | 140.0 | 83.0 | 2639 | 17.0 | 75 | 1 | ford pinto
29.0 | 4 | 97.0 | 75.0 | 2171 | 16.0 | 75 | 3 | toyota corolla
... |... |... |... |... |... |... |... |...

CarName|ModelYear |MSRP
-|-|-
ford pinto |75 |2769
toyota corolla |75 |2711
... |... |...

* Join both on their keys {CarName, ModelYear}

## More on Attributes
Different aspects of attributes (variables)
* Data (storage) type - e.g., int versus float versus string
* Measurement scales - are values discrete or continuous
* Semantic type – what the values represent (e.g., colors, ages)

### Measurement Scales
1. Nominal
    * Discrete values without inherent order
    * E.g., colors (red, blue, green), identifiers, occupation, gender
    * Often ints or strings (but could be any data type)
2. Ordinal
    * Discrete values with inherent order
    * E.g., t-shirt size (s, m, l, xl), grades (A+, A-, B+, ...)
    * No guarantee that the difference between values is same
    * Often ints or strings (but could be any data type)
3. Interval
    * Values measured on a scale of equal-sized widths
    * Unlike ordinal, can compare and quantify difference between values
    * No inherent zero point (i.e., absence)
    * Temperature (Celsius, Fahrenheit) is an example
4. Ratio
    * Interval values with an inherent zero point
    * Temperature in Kelvin is an example
    * Also counts of things (where 0 means not present)
    
### Categorical vs Continuous
* Categorical roughly means the nominal and ordinal values
* Continuous roughly means the rest (interval, ratio) ... aka "numerical"
* For many algorithms/approaches, this is enough detail

### Labeled vs Unlabeled Data
* Labeled data implies an attribute that classifies instances (e.g., mpg)
    * Goal is typically to predict the class for new instances
    * This is called "Supervised Learning"
* Unlabeled means there isn't such an attribute (for mining purposes)
    * Can still find patterns, associations, etc.
    * Generally referred to as "Unsupervised Learning"