### PandasDF 对比 SparkDF

工作方式  
pandas: 单机single machine tool，没有并行机制parallelism, 不支持Hadoop，处理大量数据有瓶颈  
spark: 分布式并行计算框架，内建并行机制parallelism，所有的数据和操作自动并行分布在各个集群结点上。以处理in-memory数据的方式处理distributed数据。支持Hadoop，能处理大量数据  

延迟机制  
pandas: not lazy-evaluated  
spark: lazy-evaluated  
    
可变性  
pandas: dataframe可变  
spark: rdds不可变，因此dataframe不可变  
    
创建  
pandas: pandasDF = sparkDF.toPandas()  
spark: sparkDF = sparkDF = SQLContext.createDataFrame(pandasDF)  
    
index索引  
pandas: 自动创建  
spark: 没有index索引  

In [4]:
import pandas as pd
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext 
from pyspark import StorageLevel
from pyspark.sql import Row

sc.stop()
sc = SparkContext()
sqlContext = HiveContext(sc)

In [21]:
# way 1
pandasDF = pd.DataFrame({"name": ["Alice", "Bob", "Cycy", "Cycy"], "height": [80, 80, 80, 80], "age": [5, 5, 10, 10]})
print(pandasDF)

rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', age=10, height=80)])
sparkDF = rdd.toDF()
sparkDF.show()

# way 2
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
pandasDF2 = pd.DataFrame(l, columns=["name", "age"])
print(pandasDF2)

rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
sparkDF2 = sqlContext.createDataFrame(people)
sparkDF2.show()

# 转换
pdf = sparkDF.toPandas()
print(pdf)

spdf = sqlContext.createDataFrame(pandasDF)
spdf.show()

    name  height  age
0  Alice      80    5
1    Bob      80    5
2   Cycy      80   10
3   Cycy      80   10
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
|  5|    80|  Bob|
| 10|    80| Cycy|
| 10|    80| Cycy|
+---+------+-----+

       name  age
0     Ankit   25
1  Jalfaizy   22
2   saurabh   20
3      Bala   26
+---+--------+
|age|    name|
+---+--------+
| 25|   Ankit|
| 22|Jalfaizy|
| 20| saurabh|
| 26|    Bala|
+---+--------+

   age  height   name
0    5      80  Alice
1    5      80    Bob
2   10      80   Cycy
3   10      80   Cycy
+-----+------+---+
| name|height|age|
+-----+------+---+
|Alice|    80|  5|
|  Bob|    80|  5|
| Cycy|    80| 10|
| Cycy|    80| 10|
+-----+------+---+



In [75]:
# 取值
print(pdf[0]) # pdf.loc[:,0] pdf.loc[:,[0,1]] column
print(pdf.loc[0]) # pdf.iloc[0] row

spdf.show()

0     5
1     5
2    10
3    10
Name: 0, dtype: int64
0        5
1       80
2    Alice
Name: 0, dtype: object
+-----+------+---+
| name|height|age|
+-----+------+---+
|Alice|    80|  5|
|  Bob|    80|  5|
| Cycy|    80| 10|
| Cycy|    80| 10|
+-----+------+---+



In [70]:
aa = pdf
aa.columns = [0, 1, 2]
aa.loc[:,[0,1]]

Unnamed: 0,0,1
0,5,80
1,5,80
2,10,80
3,10,80
