## Initializing Spark

In [1]:
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('transformation_1')
sc = SparkContext.getOrCreate(conf=conf)

- Loading File from data folder
- Creating map and flatmap

## Applying Transformations
### Map
```python
map(func)
```
Return a new distributed dataset formed by passing each element of the source through a function func.

### FlatMap

```python
flatMap(func)
```
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

In [7]:
textFile = sc.textFile('./data/data2.txt')
lines_map = textFile.map(lambda line: line.split(' '))
lines_flatmap = textFile.flatMap(lambda line: line.split(' '))

## Applying Action to get the result of the above transformation

## Collect
```python
collect()
```
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

In [8]:
lines_map.collect()

[['hello', 'world', 'hello', 'hello']]

In [9]:
lines_flatmap.collect()

['hello', 'world', 'hello', 'hello']

In [10]:
countLines = lines_flatmap.map(lambda x: (x, 1))

In [11]:
countLines.collect()

[('hello', 1), ('world', 1), ('hello', 1), ('hello', 1)]

## reduceByKey

```P.S. : reduceByKey is a transformation```

```python
reduceByKey(func, [numPartitions])
```
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

In [12]:
totalWordCount = countLines.reduceByKey(lambda x, y: x+y)

In [13]:
totalWordCount.collect()

[('world', 1), ('hello', 3)]