[Discussion] 결국 우리는 다시 data를 추가해야한다. #20

yehyunsuh · 2022-04-15T00:00:42Z

What

규범님의 게시글(https://stages.ai/competitions/184/discussion/talk/post/1269)은 그냥 ICDAR17 중 한국어 파일을 고르는 것 그 이상에 지나지 않는 글인 것 같습니다. ICDAR17 파일에는 추가 데이터가 들어가있지 않은 것 같습니다.

Why

이 사진은 #17 과정과 동일하게 MLT 파일을 다운받고 unzip한 이후 convert_mlp.py 파일을 돌린 이후 IDCAR17_MLT/images image의 개수입니다.

이 사진은 기존에 있는 IDCAR17_Korean/images 폴더에 있는 image의 개수 입니다.

그래서 제가 어제 경민님한테 slack에서 말씀드린 것처럼 convert_mlp.py 파일을 돌려도 바뀌는 부분이 없다고 말씀드린 것입니다. 실제로 convert_mlp.py의 class MLT17Dataset() 내의 50~51번째 줄 코드를 보게 되면

if 'ko' not in extra_info['languages'] or extra_info['languages'].difference({'ko', 'en'}):
    continue

라는 부분이 있습니다.� 어제 aistages에 규범님이 올린 게시판 글은, 그냥 MLT 파일 중 Korean 파일만 거르는 것을 올리신 겁니다.

이렇게 되면 #11 에서 언급한 것 처럼 부스트캠프 측에서 제공한 데이터는 추가적인 데이터를 저희 images에 합치고, annotation json 파일은 합치는 과정을 다시 진행해야합니다. 지금까지 진행해본 결과, json 파일을 그냥 합치는 것이 아니라 아래 사진과 같이

images라는 곳 안에 json file의 값들을 합쳐주어야 합니다. 이 부분을 오늘 진행해보도록 하겠습니다.

추가적으로 ICDAR Korean data에서 train과 val 로 나누는 과정도 진행해보겠습니다.

how

ICDAR Korean으로 train/val data 나눠보기 - [Features] ICDAR17_MLT dataset train/val split #21
부스트캠프 제공 데이터셋 적용시키기

The text was updated successfully, but these errors were encountered:

yehyunsuh · 2022-04-17T04:14:38Z

현재 위와 같은 에러를 계속해서 출력받고 있고, 어떻게 고쳐야할지 계속 고민중입니다. 에러에 대한 이유는 아래 깃헙에 있는 내용 중 https://github.com/airctic/icevision/issues/365

이 내용과 가가장 비슷하지 않을까 생각하고 있습니다.

만약 일요일(17일)까지 해당 issue를 해결하지 못한다면, 이 issue는 close하고 다른 issue로 넘어갈 예정입니다.

yehyunsuh · 2022-04-17T07:00:30Z

resolve the problem in the above comment

    def __getitem__(self, idx):
        ...
        for word_info in self.anno['images'][image_fname]['words'].values():
            vertices.append(np.array(word_info['points']).flatten())
            labels.append(int(not word_info['illegibility']))

changed the definition of points

def __getitem__(self, idx):
        ...
        for word_info in self.anno['images'][image_fname]['words'].values():
            points = np.array(word_info['points']).flatten()
            points = points[:8]
            vertices.append(points)
            labels.append(int(not word_info['illegibility']))

reason of Error
the baseline code only accepts 4 points for a single bounding box, but the data given from the boostcamp has bounding boxes with more than 4 points. For example, like the picture below,

when given a input like this, it shows an error in the above. So, I had to fix the dataloader getitem() to recieve 8 points from the bounding boxes.

Comment close.

yehyunsuh added the help wanted Extra attention is needed label Apr 15, 2022

yehyunsuh self-assigned this Apr 15, 2022

yehyunsuh added this to To do in Data Annotation via automation Apr 15, 2022

yehyunsuh moved this from To do to In progress in Data Annotation Apr 15, 2022

yehyunsuh closed this as completed Apr 17, 2022

Data Annotation automation moved this from In progress to Done Apr 17, 2022

This was referenced Apr 18, 2022

[Experiment] 캠퍼 data point 변경 #29

Closed

[Feat] add train/valid split code #34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] 결국 우리는 다시 data를 추가해야한다. #20

[Discussion] 결국 우리는 다시 data를 추가해야한다. #20

yehyunsuh commented Apr 15, 2022 •

edited

yehyunsuh commented Apr 17, 2022

yehyunsuh commented Apr 17, 2022

[Discussion] 결국 우리는 다시 data를 추가해야한다. #20

[Discussion] 결국 우리는 다시 data를 추가해야한다. #20

Comments

yehyunsuh commented Apr 15, 2022 • edited

What

Why

how

yehyunsuh commented Apr 17, 2022

yehyunsuh commented Apr 17, 2022

yehyunsuh commented Apr 15, 2022 •

edited