# Chapter 1: Introduction to Data Mining

## Into the Digital Era

- People's daily lives
    - 5 billion Internet users
    - 500 million tweets/day
- Scientific discovery
    - LHC: 15 PB/year; LSST: 20 TB/night
- IDC Digital Universe Report
    - 0.8 ZB (2009) $\to$ 35 ZB (2020)
    - 4.4 ZB (2013) $\to$ 44 ZB (2020)

## Why Data Mining?

- Data explosion: KB,MB,GB,TB,PB,EB,ZB,...
    - data creation, transmission, storage, sharing, processing
    - *We are drowning in data, but starving for knowledge!*
- Need automated analysis of massive data

## What Is Data Mining?

- Data mining (knowledge discovery from data)
    - extraction of interesting patterns or knowledge from huge amounts of data
    - **interesting:** valid, previously unknown, potentially useful, ultimately understandable by human
    - **huge amounts of data:** scalability, efficiency

## DM Application Areas

- **Science**
    - astrophysics, bioinformatics, drug discovery, sustainable energy, oceanography, seismology, ...
- **Web**
    - search engines, advertising, online social networks, trending, ...
- **Business**
    - market analysis, fraud detection, target marketing, churn prediction, product recommendation, ...
- **Government**
    - surveillance, crime detection, transportation, development, ...

## Data Mining Pipeline

![Image: Data Mining Pipeline](img/1.1.png)

## Data Mining: Various Views

- **Data** view
    - types of data to be mined
- **Method** view
    - types of techniques utilized
- **Knowledge** view
    - types of knowledge to be discovered
- **Application** view
    - types of applications adapted

## Data View

- The 3Vs, 4Vs, and 5Vs

![Image: Data View Vs](img/1.2.png)

- **Database-oriented**
    - relational, transactional
    - data warehouse, NoSQL
- **Sequence, stream, temporal, time-series data**
    - trend analysis, anomaly
- **Spatial, spatial-temporal data**
- **Text, multimedia, Web data**
    - topic detection, similarity, popularity, sentiment
- **Graph, social networks data**
    - substructures, shared interests, influencers, information diffusion

## Knowledge View

- Concept/class description
- Frequent patterns, associations, correlations
- Classification and prediction
- Cluster analysis
- Outlier analysis
- Evolution analysis

### Concept/Class Description

- **Data characterization (summarization)**
    - customers who spend $1000 a year
    - age 40-50, employed, good credit ratings
- **Data discrimination (contrast)**
    - frequent vs. infrequent customers: e.g. age, education, employed
    - dry vs. wet regions: e.g., precipitation, humidity, temperature

### Frequent Patterns

- Frequent **itemsets**
    - e.g., (milk, bread, egg), (beer, diaper)
- Frequent **sequences**
    - e.g., <printer, paper>, <dinner, movie>
- Frequent **structures**

![Image: Frequent Structures](img/1.3.png)

### Associations

- Association analysis
    - buys (X, milk) $\Rightarrow$ buys (X, bread)
    - [support = 0.5%, confidence = 75%]
- Minimum support (or confidence) threshold
- **Support**
    - chance of A and B appearing together
- **Confidence**
    - if A appears, chance of B appears

### Classification

- Finding a model that describes and distinguishes data classes or concepts
- Training data
- IF-THEN rules, decision tree, neural network

### Prediction

- Numerical prediction: continuous-valued instead of class labels
- E.g., weather, stock price, traffic

### Cluster Analysis

- Class labels unknown
- **Intraclass similarity**
    - maximize, closeness
- **Interclass similarity**
    - minimize, separation
- Hierarchical

### Outlier Analysis

- Outliers
    - do not comply with the general model
- Noise or exception
- Fraud detection, rare event analysis
- E.g. credit fraud analysis

### Trend and Evolution Analysis

- Trends, deviations
- Sequential pattern mining
    - e.g., traffic congestion
- Periodicity analysis
- E.g., music, applications, ...

### Market Analysis/Management

- Data sources: credit card transactions, club cards, customer calls, ...
- What types of customers buy what products
- What factors attract new customers
- Target marketing, product recommendation, discount
- Fraud detection

## Are the Patterns Interesting?

- Interesting pattern
    - **valid** on new/test data with some certainty
    - **novel**
    - potentially **useful**
    - ultimately **understandable** by humans
- Objective measures
    - e.g., support, confidence, false positive/negative, accuracy
- Subjective measures
- Completeness, exclusiveness

## Major Issues in Data Mining

- **Mining technology**
    - mining different knowledge from diverse data (maybe noisy or incomplete)
    - pattern evaluation: interestingness
    - efficiency, effectiveness, scalability
    - parallel, distributed, incremental mining
    - incorporation of background knowledge
    - integration of discovered knowledge with existing knowledge
- **User interaction**
    - data mining query languages, ad-hoc mining
    - expression and visualization of results
    - interactive mining at multiple granularities
- **Applications and social impacts**
    - domain-specific data mining
    - applications of data mining results
    - protect data security, integrity, privacy

## Data Science Ethics

- Data ownership
- Privacy, anonymity
- Data and model validity
- Data and model bias (algorithmic fairness)
- Interpretation, application, societal consequence

## Data Mining Resources

- ACM SIGKDD: [https://www.kdd.org/](https://www.kdd.org/)
- Conferences
    - **KDD: tutorials, research, applied data science, KDD Cup, sponsors**
    - SDM, ICDM, WSDM, CIKM, ICDE, TheWebConference (formerly WWW), SIGIR, ICML, CVPR, NuerIPS (formerly NIPS), SIGMOD, VLDB, ...
- Journals
    - TKDE, TKDD, DMKD, TPAMI, ...