# 图像预处理生成空间拓扑
- 选择带标签的图片，并挑选三个集合，5000， 10000， 50000
- 将图片copy到对应文件夹保存
- 生成pspnet，maskrcnn和saliency三个图
- 融合图片，确定物体中心，计算图片信息并保存
- 根据图片分布进行概率更新和验证

## 1  挑选集合（去掉不带标签的图片列表）

In [1]:
import os
import cv2
import random
import shutil
import pandas as pd

### 1.1 设置图片张数，读取图片

In [2]:
# 三个集合的个数
N1 = 5000
N2 = 10000
N3 = 50000

# 读取完整数据集，转化为df格式
ava_path = "/home/flyingbird/Flyingbird/AVA/AVA_dataset/AVA_with_segs_scores_aesthetic.txt"
df = pd.read_csv(ava_path, sep=' ')

In [3]:
df.head(5)

Unnamed: 0,index,image_id,seg_1,seg_2,ave_scores,aesthetic_image
0,1,953619,Abstract,Macro,5.637097,1
1,2,953958,Abstract,Black_and_White,4.698413,0
2,3,954184,0,0,5.674603,1
3,4,954113,Nature,Black_and_White,5.773438,1
4,5,953980,Macro,Floral,5.209302,1


### 1.2 筛选带标签的图片

In [4]:
# 选择不带标签的图片dataframe数据行
no_label_df = df[(df['seg_1'] == '0')&(df['seg_2'] == '0')]

# 提取出其image_id
df_id = no_label_df['image_id']

# 选择带标签的图片
df_have_label = df[df['image_id'].isin(df_id).apply(lambda x: not x)]

In [5]:
df_have_label.head(5)

Unnamed: 0,index,image_id,seg_1,seg_2,ave_scores,aesthetic_image
0,1,953619,Abstract,Macro,5.637097,1
1,2,953958,Abstract,Black_and_White,4.698413,0
3,4,954113,Nature,Black_and_White,5.773438,1
4,5,953980,Macro,Floral,5.209302,1
5,6,954175,Nature,Insects_etc,5.6,1


### 1.3 去掉在图片文件中索引不到的数据行

In [6]:
"""
list_label = os.listdir('images')
special_image_id = []
count = 0

for i in df_have_label['image_id'].values:
    str_image = str(i) + '.jpg'
    if str_image not in list_label:
        special_image_id.append(str_image)
    count += 1
    if count % 1000 == 0:
        print('%d nums check over'%count)
"""

"\nlist_label = os.listdir('images')\nspecial_image_id = []\ncount = 0\n\nfor i in df_have_label['image_id'].values:\n    str_image = str(i) + '.jpg'\n    if str_image not in list_label:\n        special_image_id.append(str_image)\n    count += 1\n    if count % 1000 == 0:\n        print('%d nums check over'%count)\n"

In [7]:
# 去掉原始数据中索引不到的行数据(953841 397289 953619 953958)
#list_error = [953841, 397289, 953619, 953958, 310261, 398594, 848725, 148477, 11066]
list_error = [954113, 953980, 954175, 953349, 953897, 444892, 567829, 638163, 104855, 430454, 148477, 953619, \
              953841, 397289, 953619, 953958, 310261, 398594, 848725, 148477, 11066]

for i in list_error:
    df_have_label.drop(df_have_label[df_have_label['image_id'] == i].index, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


### 1.4 随机选择对应数量图片，保存为txt文件

In [8]:
# 随机抽取N张图像
def save_image(save_file_name, N):
    total_set = df_have_label.sample(n=N, random_state=66)
    total_set.to_csv(save_file_name, sep=' ', index=None)

In [9]:
save_image('./image_label_txt/total_image_5000', N1)
save_image('./image_label_txt/total_image_10000', N2)
save_image('./image_label_txt/total_image_50000', N3)

## 2 将图片copy到对应文件夹保存

## 2.1 输入对应txt文件路径、image路径和copy保存路径

In [10]:
# '/home/flyingbird/Flyingbird/Test/images'
main_root = '/home/flyingbird/Flyingbird/image_pre_process'
image_file = main_root + '/images'
source_path1 = main_root + '/image_label_txt/total_image_5000'
source_path2 = main_root + '/image_label_txt/total_image_10000'
source_path3 = main_root + '/image_label_txt/total_image_50000'

save_path1 = main_root + '/original_image/5000'
save_path2 = main_root + '/original_image/10000'
save_path3 = main_root + '/original_image/50000'

## 2.2 索引列表，copy图片

In [11]:
def get_images_name(source_path, save_path, N):
    """
    input: 原路径列表，保存路径列表（str)
    output: 是否执行成功(yes or no)
    """
    counts = 0
    
    with open(source_path, 'r') as f:
        images_list = []
        for i in f.readlines():
            images_list.append(i.strip().split(' ')[1] + '.jpg')
        images_list = images_list[1:]
    
    for images_name in images_list:
        images_id = os.path.join(image_file, images_name)
        #if images_name in os.listdir(image_file):
        shutil.copy(images_id, save_path)
        counts += 1
        if counts % 1000 == 0 and counts != 0:
            print('%d images have copied over'%counts)
        if counts == N:
            print('*'*20)
            print('copy all images over')

In [12]:
get_images_name(source_path1, save_path1, 5000)

1000 images have copied over
2000 images have copied over
3000 images have copied over
4000 images have copied over
5000 images have copied over
********************
copy all images over


In [25]:
get_images_name(source_path2, save_path2, 10000)

1000 images have copied over
2000 images have copied over
3000 images have copied over
4000 images have copied over
5000 images have copied over
6000 images have copied over
7000 images have copied over
8000 images have copied over
9000 images have copied over
10000 images have copied over
********************
copy all images over


In [13]:
get_images_name(source_path3, save_path3, 50000)

1000 images have copied over
2000 images have copied over
3000 images have copied over
4000 images have copied over
5000 images have copied over
6000 images have copied over
7000 images have copied over
8000 images have copied over
9000 images have copied over
10000 images have copied over
11000 images have copied over
12000 images have copied over
13000 images have copied over
14000 images have copied over
15000 images have copied over
16000 images have copied over
17000 images have copied over
18000 images have copied over
19000 images have copied over
20000 images have copied over
21000 images have copied over
22000 images have copied over
23000 images have copied over
24000 images have copied over
25000 images have copied over
26000 images have copied over
27000 images have copied over
28000 images have copied over
29000 images have copied over
30000 images have copied over
31000 images have copied over
32000 images have copied over
33000 images have copied over
34000 images have c

##  3 得到pspnet，maskrcnn和saliency三个图

## 4 融合图片，确定物体中心，计算图片信息并保存

## 5 根据图片分布进行概率更新和验证