<a href="https://colab.research.google.com/github/angelkp570/CursoSpark/blob/master/Clase5_Particion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Particionando Datos


In [1]:
# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

# set your spark folder to your system path environment.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"


# install findspark using pip
!pip install -q findspark

In [2]:
import findspark
findspark.init()

In [3]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, Row

from pyspark.sql import SQLContext


In [4]:
# Le pasamos el 5 para indicar el número de particiones
spark = SparkSession.builder.appName("Particionado")\
  .master("local[5]").getOrCreate()

In [5]:
df = spark.range(0,20)
df.rdd.getNumPartitions()

5

In [6]:
rdd1 = spark.sparkContext.parallelize((0, 20), 10)
rdd1.getNumPartitions()

10

10 Particiones de un rdd

In [7]:
rddDesdeArchivo = spark.sparkContext.textFile("deporte.csv",10)

In [8]:
rddDesdeArchivo.getNumPartitions()

10

Veremos el método paraq guardar nuestros DataFrames o RDD que creamos

In [9]:
rddDesdeArchivo.saveAsTextFile("/content/salidatexto")

Muestra nuestras 10 particiones

In [10]:
!ls /content/salidatexto

part-00000  part-00002	part-00004  part-00006	part-00008  _SUCCESS
part-00001  part-00003	part-00005  part-00007	part-00009


In [12]:
!head -n 5 /content/salidatexto/part-00000

deporte_id,deporte
1,Basketball
2,Judo
3,Football
4,Tug-Of-War


In [13]:
rdd = spark.sparkContext.wholeTextFiles("/content/salidatexto/*")

In [14]:
rdd.take(4)

[('file:/content/salidatexto/part-00000',
  'deporte_id,deporte\n1,Basketball\n2,Judo\n3,Football\n4,Tug-Of-War\n5,Speed Skating\n6,Cross Country Skiing\n'),
 ('file:/content/salidatexto/part-00001',
  '7,Athletics\n8,Ice Hockey\n9,Swimming\n10,Badminton\n11,Sailing\n12,Biathlon\n13,Gymnastics\n14,Art Competitions\n'),
 ('file:/content/salidatexto/part-00002',
  '15,Alpine Skiing\n16,Handball\n17,Weightlifting\n18,Wrestling\n19,Luge\n20,Water Polo\n'),
 ('file:/content/salidatexto/part-00003',
  '21,Hockey\n22,Rowing\n23,Bobsleigh\n24,Fencing\n25,Equestrianism\n26,Shooting\n27,Boxing\n28,Taekwondo\n')]

In [15]:
lista = rdd.mapValues(lambda x: x.split()).collect()

In [16]:
lista

[('file:/content/salidatexto/part-00000',
  ['deporte_id,deporte',
   '1,Basketball',
   '2,Judo',
   '3,Football',
   '4,Tug-Of-War',
   '5,Speed',
   'Skating',
   '6,Cross',
   'Country',
   'Skiing']),
 ('file:/content/salidatexto/part-00001',
  ['7,Athletics',
   '8,Ice',
   'Hockey',
   '9,Swimming',
   '10,Badminton',
   '11,Sailing',
   '12,Biathlon',
   '13,Gymnastics',
   '14,Art',
   'Competitions']),
 ('file:/content/salidatexto/part-00002',
  ['15,Alpine',
   'Skiing',
   '16,Handball',
   '17,Weightlifting',
   '18,Wrestling',
   '19,Luge',
   '20,Water',
   'Polo']),
 ('file:/content/salidatexto/part-00003',
  ['21,Hockey',
   '22,Rowing',
   '23,Bobsleigh',
   '24,Fencing',
   '25,Equestrianism',
   '26,Shooting',
   '27,Boxing',
   '28,Taekwondo']),
 ('file:/content/salidatexto/part-00004',
  ['29,Cycling',
   '30,Diving',
   '31,Canoeing',
   '32,Tennis',
   '33,Modern',
   'Pentathlon',
   '34,Figure',
   'Skating',
   '35,Golf']),
 ('file:/content/salidatexto/part-0

In [17]:
lista = [l[0] for l in lista]
lista.sort()

In [18]:
lista

['file:/content/salidatexto/part-00000',
 'file:/content/salidatexto/part-00001',
 'file:/content/salidatexto/part-00002',
 'file:/content/salidatexto/part-00003',
 'file:/content/salidatexto/part-00004',
 'file:/content/salidatexto/part-00005',
 'file:/content/salidatexto/part-00006',
 'file:/content/salidatexto/part-00007',
 'file:/content/salidatexto/part-00008',
 'file:/content/salidatexto/part-00009']

In [19]:
rdd2 = spark.sparkContext.textFile(",".join(lista), 10).map(lambda l: l.split(","))

In [21]:
rdd2.take(4)

[['deporte_id', 'deporte'],
 ['1', 'Basketball'],
 ['2', 'Judo'],
 ['3', 'Football']]